Image stitching with dynamic seam placement based on object saliency for surround view visualization

ABSTRACT

In various examples, dynamic seam placement is used to position seams in regions of overlapping image data to avoid crossing salient objects or regions. Objects may be detected from image frames representing overlapping views of an environment surrounding an ego-object such as a vehicle. The images may be aligned to create an aligned composite image or surface (e.g., a panorama, a 360° image, bowl shaped surface) with regions of overlapping image data, and a representation of the detected objects and/or salient regions (e.g., a saliency mask) may be generated and projected onto the aligned composite image or surface. Seams may be positioned in the overlapping regions to avoid or minimize crossing salient pixels represented in the projected masks, and the image data may be blended at the seams to create a stitched image or surface (e.g., a stitched panorama, stitched 360° image, stitched textured surface).

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No.63/326,724, filed on Apr. 1, 2022, the contents of which areincorporated by reference in their entirety. This application is relatedto U.S. Application No. TBD, entitled “Image Stitching with Dynamic SeamPlacement based on Ego-Vehicle State for Surround View Visualization,”filed on Feb. 23, 2023; U.S. Application No. TBD, entitled “ImageStitching with Adaptive Three-Dimensional Bowl Model of the SurroundingEnvironment for Surround View Visualization,” filed on Feb. 23, 2023;U.S. Application No. TBD, entitled “Under Vehicle Reconstruction forVehicle Environment Visualization,” filed on Feb. 23, 2023; and U.S.Application No. TBD, entitled “Optimized Visualization Streaming forVehicle Environment Visualization,” filed on Feb. 23, 2023.

BACKGROUND

Vehicle Surround View Systems provide occupants of a vehicle with avisualization of the area surrounding the vehicle. For drivers, SurroundView Systems provide the driver with the ability to view the surroundingarea, including blind spots where the driver's line of sight is occludedby parts of the vehicle or other objects in the environment, without theneed to reposition (e.g., turn their head, get off the driver's seat,lean a certain direction, etc.). This visualization may assist andfacilitate a variety of driving maneuvers, such as smoothly entering orexiting a parking spot without hitting vulnerable road users likepedestrians or objects such as a road curb or other vehicles. More andmore vehicles, especially luxury brands or new models, are beingproduced with Surround View Systems equipped.

Existing vehicle Surround View Systems usually utilize fisheyecameras—typically mounted at the front, left, rear and right sides ofthe vehicle body—to perceive the surrounding area from multipledirections. Additional cameras may be included in special cases, likefor long trucks or vehicles with trailers. Frames from the individualcameras are stitched together using camera parameters to align framesand blending techniques to combine overlapping regions to provide ahorizontal 360° surround view visualization.

Due to noise or various white balance configurations, a noticeable seammay appear where two images are stitched together. Although variousmitigation measures may be used to smooth out the transition of imagepixel values from one image to another (e.g., assigning pixel weightproportional to its distance to the edge, multiresolution basedblending, neural network based blending), a noticeable seam is oftenstill visible in a stitched image. Some conventional techniques attemptto avoid placing seams on top of objects detected using ultrasonicsensors. However, ultrasonic sensors typically operate over a very shortrange. As a result, conventional techniques only consider very closeobjects when placing seams, effectively ignoring potentially importantobjects outside the ultrasonic sensing range, or prioritizing lessimportant objects within the ultrasonic sensing range, and placing seamsover regions of a stitched image that are potentially important for adriver to safely maneuver a vehicle.

Unfortunately, the process of stitching images may introduce a varietyof artifacts, including geometric distortions (e.g., misalignments),texture distortions (e.g., blur, ghosting, object disappearance, objectdistortions), and color distortions. Distortion in a stitched image maybe caused by a variety of issues, including the parallax effect, lensdistortion, moving objects, and differences in exposure or illumination.For example, in some cases, capturing multiple images of a moving objectusing multiple cameras may in effect capture images of the object fromdifferent perspectives, such that different images may capture theobject with different orientations or poses. Since the differentrepresentations of the object may not perfectly align when stitchingimages, an overlapping region may effectively combine images of theobject and the background, creating a ghost-like effect that may appearin two locations. Some objects may even disappear from the stitchedimage. Since these stitching artifacts may obscure useful visualinformation and can be distracting to the driver, stitching artifactscan interfere with the safe operation of the vehicle. As a result, thereis a need for improved stitching techniques that reduce stitchingartifacts, better represent useful visual information in a stitchedimage, and/or otherwise improve the visual quality of stitched images.

Moreover, in some existing Surround View Systems, two-dimensional (2D)images are used to approximate a three-dimensional (3D) visualrepresentation of the environment surrounding the vehicle. For a givenfisheye image, for example, each pixel captures a ray emitting from asurrounding 3D point projecting into the center of the fisheye cameraand imaging into the camera sensor, which captures intensity, color, andorientation of the 3D point. However, the distance between the point andthe camera center is lost in the projection process. The 3D point may beanywhere along the ray defined by the camera center and the point on thesensor where the ray lands. Reconstructing this distance, for exampleusing LiDAR data or Structure from Motion (SfM) is computationallyexpensive and often impractical. As a result, some Surround View Systemsmodel the geometry of the environment surrounding the vehicle as astatically shaped and sized 3D bowl shape comprising a circular groundplane for the inner portion of the bowl connected to an outer bowlrepresented as a curved surface rising from the ground plane to a heightor with a slope that increases proportionally to the distance from thebowl center. As such, some conventional systems project images onto this3D bowl shape, render a view of the projected image data on the 3D bowlshape from the perspective of a virtual camera, and present the renderedview on a monitor visible to occupants (e.g., driver) of the vehicle.

However, conventional techniques that rely on this 3D bowl geometry havea variety of drawbacks. Generally, since this geometry is just a model,depth in the real world often does not match the assumed geometry,resulting in a variety of visual artifacts. For example, if a cameracaptures an object that is closer to the vehicle than the side of thebowl (e.g., inside the bowl), projecting the image of the object ontothe side of the bowl creates a scale magnification since the depth isincorrect. Captured objects that are between the different regions ofthe bowls can also result in visualizations with significantdistortions. In some cases, objects that are inside the bowl may bepartially or completely lost during the projection process, leading topartial or complete object disappearance. If multiple cameras pick up anobject that is farther away than the side of the bowl (e.g., outside thebowl), projecting these images onto the side of the bowl can lead toghosting or duplication. Since these artifacts may obscure or omituseful visual information and are often distracting to the driver, theartifacts can interfere with the safe operation of the vehicle. As aresult, there is a need for improved visualization techniques thatreduce visual artifacts, better represent useful visual information in astitched image, and/or otherwise improve the visual quality of stitchedimages.

Furthermore, existing Surround View Systems typically do not incorporateframes from cameras under the vehicle, due to the difficulty of keepinga camera lens clean or an insufficient field of view coverage. As aresult, existing vehicle Surround View Systems can only visualize front,left, rear, and right sides of a vehicle, leaving a significant blindspot: the area under the vehicle, which may be multiple square meters ormore. Conventional techniques either do not include a visualizationunder the vehicle, or fill the area with artificial pixels, for example,using pure black to demonstrate that information is missing, or using acomputer graphic 3D model of the vehicle. However, this under vehiclearea is important for perception in many cases, such as: preciselymaneuvering in a controlled manner in and out of narrow parking spacesor areas with high curbs, passing over speed bumps or potholes, avoidingobjects on the road, driving off-road over rocks or rough terrain,and/or other scenarios. As such, conventional vehicle Surround ViewSystems are incapable of providing visual information that is beneficialfor many driving maneuvers.

Finally, existing Surround View Systems are typically limited tocapturing local sensor data and presenting a representation of thesensor data to occupants of the vehicle (e.g., the driver). Ascommunication and network technology advances, there is a growing needto facilitate various remote experiences and functionality.

SUMMARY

Embodiments of the present disclosure relate to surround view orenvironment visualization, dynamic seam placement based on objectsaliency, dynamic seam placement based ego-object state, an adaptive 3Dbowl that models the surrounding environment with a shape that dependson distance and direction to detected objects, reconstruction of thearea under an ego-object, and/or streaming a representation of anenvironment in and around an ego-object.

In contrast to conventional systems, such as those described above,dynamic seam placement may be used to position seams in regions ofoverlapping image data to avoid crossing salient objects or regions.Objects may be detected from image frames representing overlapping viewsof an environment surrounding an ego-object such as a vehicle. Theimages may be aligned to create an aligned composite image or surface(e.g., a panorama, a 360° image, bowl shaped surface) with overlappingregions of image data, and a representation of the detected objectsand/or salient regions (e.g., a saliency mask) may be generated andprojected onto the aligned composite image or surface. Seams may bepositioned in the overlapping regions to avoid or minimize crossingsalient pixels represented in the projected masks, and the image datamay be blended at the seams to create a stitched image or surface (e.g.,a stitched panorama, stitched 360° image, stitched textured surface).

In some embodiments, a state machine is used to select between a defaultseam placement or dynamic seam placement that avoids salient regions,and to enable and disable dynamic seam placement based on speed ofego-motion, direction of ego-motion, proximity to salient objects,active viewport, driver gaze, and/or other factors. Images representingoverlapping views of an environment may be aligned to create an alignedcomposite image or surface (e.g., a panorama, a 360° image, bowl shapedsurface) with overlapping regions of image data, and a default ordynamic seam placement may be selected based on driving scenario (e.g.,driving direction, speed, proximity to nearby objects). As such, seamsmay be positioned in the overlapping regions of image data, and theimage data may be blended at the seams to create a stitched image orsurface (e.g., a stitched panorama, stitched 360° image, stitchedtextured surface).

In some embodiments, an environment surrounding an ego-object isvisualized using an adaptive 3D bowl that models the environment with ashape that changes based on distance (and direction) to detectedobjects. Distance (and direction) to detected objects may be determinedusing 3D object detection or a top-down 2D or 3D occupancy grid, andused to adapt the shape of the adaptive 3D bowl in various ways (e.g.,by sizing its ground plane to fit within the distance to the closestdetected object). The adaptive 3D bowl may be enabled or disabled duringeach time slice (e.g., based on ego-speed), and the 3D bowl for eachtime slice may be used to render a visualization of the environment(e.g., a top-down projection image, a textured 3D bowl, and/or arendered view thereof).

In some embodiments, cached sensor data captured by an ego-object andego-motion of the ego-object are used to reconstruct the area under thevehicle in real time. For example, image data captured over time by avehicle may be cached into a composite map that visualizes the ground ordrivable area, and the vehicle's ego-motion may be used to retrieve aregion of the composite map corresponding to the under vehicle area. Foreach time slice, a newly captured or generated image representing thattime slice may be used to generate a local map of an observed portion ofthe ground, and the local map may be merged with a composite map thatrepresents previously observed local maps. Accordingly, the undervehicle area for that time slice may be reconstructed by retrievingcorresponding pixels from the composite map using the vehicle'sego-motion.

In some embodiments, sensor data may be captured by sensors of anego-object, such as a vehicle traveling in a physical environment, and arepresentation of the sensor data may be streamed from the ego-object toa remote location to facilitate various remote experiences, such asstreaming to a remote viewer (e.g., a friend or relative), streaming toa remote or fleet operator, streaming to a mobile app configured toself-park or summon an ego-object, rendering a 3D augmented reality (AR)or virtual reality (VR) representation of the physical environment,and/or others. In some embodiments, the stream includes one or morecommand channels used to control data collection, rendering, streamcontent, or even vehicle maneuvers, such as during an emergency,self-park, or summon scenario

BRIEF DESCRIPTION OF THE DRAWINGS

The present systems and methods for surround view or environmentvisualization are described in detail below with reference to theattached drawing figures, wherein:

FIG. 1 is a diagram illustrating an example data flow through an exampleSurround View System, in accordance with some embodiments of the presentdisclosure;

FIGS. 2A and 2B illustrate an example of image stitching, in accordancewith some embodiments of the present disclosure;

FIG. 3 is a diagram illustrating an example data flow through an exampleimage stitching system, in accordance with some embodiments of thepresent disclosure;

Dynamic Seam Placement and State-Dependent Image Stitching

FIG. 4A is an example surround view visualization with a seam placedvertically to avoid cutting surrounding cars, FIG. 4B is an examplesurround view visualization with a seam extending vertically along theleft surface of a vehicle, FIG. 4C is an example surround viewvisualization with a seam extending horizontally from the side surfacesof a vehicle to avoid artifacts in front of the vehicle, FIG. 4D is anexample surround view visualization updated with a backwards facingviewport based on a vehicle traveling in reverse, FIG. 4E illustratesexample surround view visualizations with seams placed to avoid salientregions in various viewports, and FIG. 4F is an example viewport costmap, in accordance with some embodiments of the present disclosure;

FIG. 5 is an example dynamic seam stitching module, in accordance withsome embodiments of the present disclosure;

FIG. 6 is an example dynamic seam placement state machine, in accordancewith some embodiments of the present disclosure;

FIG. 7A is a diagram illustrating an example dynamic seam placementtechnique using object detection, mask projection, and seem steering,and FIG. 7B is an enlarged view of the overlapping mask projectionsshown in FIG. 7A, in accordance with some embodiments of the presentdisclosure;

FIGS. 8A and 8B are example images and surround view visualizations withand without dynamic seam placement, in accordance with some embodimentsof the present disclosure;

FIG. 9 shows example surround view visualizations with and withoutdynamic seam placement, in accordance with some embodiments of thepresent disclosure;

FIG. 10 is a flow diagram showing a method for determining a positionfor a seam using projected masks, in accordance with some embodiments ofthe present disclosure;

FIG. 11 is a flow diagram showing a method for determining a positionfor a seam based at least on projected overlapping representations ofone or more salient regions, in accordance with some embodiments of thepresent disclosure;

FIG. 12 is a flow diagram showing a method for determining whether touse a default placement or a dynamic placement for a seam, in accordancewith some embodiments of the present disclosure;

Adaptive 3D Bowl Model of the Surrounding Environment

FIG. 13 illustrates an example data flow through an example surroundview visualization system that uses a 3D bowl, in accordance with someembodiments of the present disclosure;

FIG. 14 shows example surround view visualizations of textured circularand elliptical 3D bowls, in accordance with some embodiments of thepresent disclosure;

FIG. 15 shows some example artifacts that can arise in surround viewvisualizations that use a 3D bowl;

FIG. 16 illustrates an example adaptive 3D bowl generator that adapts a3D bowl based on distance to detected objects, in accordance with someembodiments of the present disclosure;

FIG. 17A is an example 2D top-down occupancy grid, and FIG. 17Billustrates an example adaptive 3D bowl generation technique that adaptsa 3D bowl using a 2D top-down occupancy grid, in accordance with someembodiments of the present disclosure;

FIG. 18 is an example adaptive 3D bowl state machine, in accordance withsome embodiments of the present disclosure;

FIG. 19 is a diagram illustrating an example data flow through aSurround View System that includes adaptive 3D bowl and dynamic seamplacement state machines, in accordance with some embodiments of thepresent disclosure;

FIG. 20 shows example surround view visualizations with and without anadaptive 3D bowl, in accordance with some embodiments of the presentdisclosure;

FIG. 21 is a flow diagram showing a method for generating a surroundview visualization based at least on an adaptive 3D bowl, in accordancewith some embodiments of the present disclosure;

FIG. 22 is a flow diagram showing a method for generating a surroundview visualization based at least on an adaptive 3D bowl with a shapethat depends on distance to the closest detected object, in accordancewith some embodiments of the present disclosure;

Under Vehicle Reconstruction

FIG. 23 is a diagram illustrating an example data flow through an undervehicle reconstruction system, in accordance with some embodiments ofthe present disclosure;

FIG. 24 is an example surround view visualization with a transparentvehicle rendered over an under vehicle reconstruction on a textured 3Dsurface, in accordance with some embodiments of the present disclosure;

FIG. 25 shows example vehicle rig and camera coordinate systems, inaccordance with some embodiments of the present disclosure;

FIG. 26 is cross-section view of an example 3D bowl, in accordance withsome embodiments of the present disclosure;

FIG. 27 illustrates an example technique for updating a composite mapthat visualizes a drivable area, in accordance with some embodiments ofthe present disclosure;

FIG. 28 shows an example under vehicle reconstruction using simulatedfisheye images, in accordance with some embodiments of the presentdisclosure;

FIG. 29 shows an example under vehicle reconstruction using real fisheyeimages, in accordance with some embodiments of the present disclosure;

FIG. 30 is a flow diagram showing a method for virtually reconstructingan area of a ground plane under an ego-object, in accordance with someembodiments of the present disclosure;

FIG. 31 is a flow diagram showing a method for merging a local map intoa representation of a ground plane for virtually reconstructing an areaof the ground plane under an ego-object, in accordance with someembodiments of the present disclosure;

Optimized Visualization Streaming

FIG. 32 is an example representation streaming system, in accordancewith some embodiments of the present disclosure;

FIG. 33 is an example augmented reality and/or virtual reality simulatorsystem, in accordance with some embodiments of the present disclosure;

FIG. 34 is a flow diagram showing a method for streaming representationsof a physical environment, in accordance with some embodiments of thepresent disclosure;

Autonomous Vehicle

FIG. 35A is an illustration of an example autonomous vehicle, inaccordance with some embodiments of the present disclosure;

FIG. 35B is an example of camera locations and fields of view for theexample autonomous vehicle of FIG. 35A, in accordance with someembodiments of the present disclosure;

FIG. 35C is a block diagram of an example system architecture for theexample autonomous vehicle of FIG. 35A, in accordance with someembodiments of the present disclosure;

FIG. 35D is a system diagram for communication between cloud-basedserver(s) and the example autonomous vehicle of FIG. 35A, in accordancewith some embodiments of the present disclosure;

FIG. 36 is a block diagram of an example computing device suitable foruse in implementing some embodiments of the present disclosure; and

FIG. 37 is a block diagram of an example data center suitable for use inimplementing some embodiments of the present disclosure.

DETAILED DESCRIPTION

Systems and methods are disclosed related to surround view orenvironment visualization, dynamic seam placement based on objectsaliency, dynamic seam placement based ego-object state, an adaptive 3Dbowl that models the surrounding environment with a shape that dependson distance and direction to detected objects, reconstruction of thearea under an ego-object, and/or streaming a representation of anenvironment in and around an ego-object. Although the present disclosuremay be described with respect to an example autonomous vehicle 3500(alternatively referred to herein as “vehicle 3500” or “ego-vehicle3500,” an example of which is described with respect to FIGS. 35A-35D),this is not intended to be limiting. For example, the systems andmethods described herein may be used by, without limitation,non-autonomous vehicles, semi-autonomous vehicles (e.g., in one or moreadvanced driver assistance systems (ADAS)), piloted and un-pilotedrobots or robotic platforms, warehouse vehicles, off-road vehicles,vehicles coupled to one or more trailers, flying vessels, boats,shuttles, emergency response vehicles, motorcycles, electric ormotorized bicycles, aircraft, construction vehicles, underwater craft,drones, and/or other vehicle or types. In addition, although the presentdisclosure may be described with respect to surround view visualizationsfor vehicles, this is not intended to be limiting, and the systems andmethods described herein may be used to provide one or more multi-view,composite view, and/or proximity view representations in augmentedreality, virtual reality, mixed reality, robotics, camera probes (e.g.,medical or surgical probes), security and surveillance, autonomous orsemi-autonomous machine applications, and/or any other technology spaceswhere image stitching, environment visualization, and/or representationstreaming may be used.

Dynamic Seam Placement and State-Dependent Stitching

Systems and methods relating to dynamic seam placement andstate-dependent image stitching are disclosed. For example, systems andmethods are disclosed that use dynamic seam placement to avoid salientobjects, and state-dependent image stitching based on speed ofego-motion, direction of ego-motion, proximity to salient objects,active viewport, driver gaze, and/or other factors. The presenttechniques may be utilized to visualize an environment around anego-object, such as a vehicle, robot, and/or other type of object, insystems such as parking visualization systems, Surround View Systems,and/or others.

At a high level, objects (e.g., salient objects) may be detected fromsensor data (e.g., fisheye images) representing an environmentsurrounding an ego-object such as a vehicle. Image data representing theenvironment may be aligned to create an aligned composite image orsurface (e.g., a panorama, a 360° image, bowl shaped surface) withregions of overlapping image data, and a representation of the detectedobjects (e.g., a saliency mask) may be projected onto the alignedcomposite image or surface. Seams may be positioned in the overlappingregions to avoid crossing salient objects, and the image data may beblended at the seams to create a stitched image or surface (e.g., astitched panorama, stitched 360° image, stitched textured surface). Insome embodiments, a state machine implementing a decision tree is usedto determine whether to use a default seam placement or dynamic seamplacement that avoids salient objects, and to enable and disable dynamicseam placement based on speed of ego-motion, direction of ego-motion,proximity to salient objects, active viewport, driver gaze, and/or otherfactors.

Object detection may be performed using 2D object detection (e.g., fromimages) or 3D object detection (e.g., from images, a 3D point cloud ofLiDAR or RADAR detections). In some embodiments, salient objects may beidentified from detected objects, for example, based on object class(e.g., emergency vehicles, road users, moving objects), driving scenario(e.g., surrounding vehicles when parking between the vehicles, any closeobject when parallel parking), distance to the vehicle, and/or otherfactors. Depending on the implementation, objects deemed to benon-salient may be ignored (filtered out), and/or salient objects may beassigned a higher priority (or weight) than non-salient objects. Forexample, object detection may generate a binary object mask representingwhether or not each point (e.g., each pixel in a 2D mask correspondingto each image) corresponds to a depicted part of a detected object, andthe binary object mask may be used to derive a weighted saliency maskrepresenting a measure of saliency of each point (e.g., by weighting thebinary object mask based on proximity of a corresponding detectedobject, prioritizing certain classifications of detected object, usinglogic that depends on driving scenario). In some embodiments, amachine-learning model may be trained to predict a weighted saliencymask using training data labeled with a measure of ground truth saliency(e.g., objects or regions deemed to be important by human labelers). Insome cases, driver gaze may be monitored (e.g., using a camera of adriver monitoring system), the region of the environment where thedriver is looking may be projected into a corresponding region of thesaliency mask, and the corresponding region may be updated to representa measure of saliency. In some cases, a salient portion of a designatedviewport (e.g., pixels toward the center of the field of view of thedesignated viewport) may be projected into a corresponding region of thesaliency mask, and the corresponding region may be updated to representa measure of saliency to emphasize placing a seam toward the edge orboundary of the field of view of the designated viewport. These are justa few examples, and other techniques that quantify saliency arecontemplated within the scope of the present disclosure.

In some embodiments, to identify detected objects and/or salientregions, object detection may be performed on multiple images of anenvironment to create multiple object and/or saliency masks. The imagesmay be aligned to create a composite image or surface (e.g., a panorama,360° image, or bowl shaped surface) with regions of overlapping imagedata, and the object and/or saliency masks may be projected onto thecomposite image or surface. In some cases (e.g., for some drivingscenarios like parking, based on a threshold speed of ego-motion),objects determined to be beyond some threshold distance from the vehicle(e.g., determined using corresponding LiDAR or RADAR detections) may bediscarded from the masks. As such, any given overlapping region of imagedata may be overlaid with two (or more) different object and/or saliencymasks representing objects or other salient regions detected by multiplesensors.

Depending on the implementation, different techniques may be applied todetermine where to place a seam within a region of overlapping imagedata. In some cases, a seam may be steered through each overlappingregion to avoid or otherwise minimize cutting through or intersectingpixels of a salient region detected by multiple sensors (e.g., asrepresented in corresponding masks), for example, using an energyfunction that penalizes seams that cover salient pixels (e.g., weightedby saliency). Taking a straight seam and a substantially rectangularoverlapping region as an example, one end of a seam may be placed at oneend point in the region, such as a corner representing the closestportion of the environment to the vehicle, and a straight line may bedrawn from that end point to another end point at an opposite portion ofthe region. For example, starting at the bottom right corner of theoverlapping region, a ray may be projected out at an angle theta, a seamspace may be scanned over different values of theta to cover the region(e.g., from 0-90°), and a corresponding path may be traversed throughthe pixels of the region and evaluated.

In some embodiments, scanning a seam space of candidate seams andevaluating candidate seams serves to identify a seam (e.g., whetherstraight or some other shape) that avoids salient regions represented inboth overlaid masks. Additionally or alternatively (e.g., if no suchcandidate seam exists), candidate seams may be evaluated to identify aseam that minimizes intersected salient pixels, that minimizes an energyfunction (e.g., having a cost that penalizes candidate seams thatintersect or cover salient pixels, is weighted by saliency or proximity,is based on multiple images from different time slices to promotetemporal stability), and or otherwise. If there are multiple seams thatwould satisfy the designated criteria (e.g., that have the same cost ornumber of intersected salient pixels), temporal stability may be used tochoose a seam location that minimizes the change in seam positioncompared to the previous frame. In some embodiments, costs for differentseam shapes may be evaluated using a connected components analysisand/or using dynamic programming (e.g., based on designated permissibledirections in which a seam may traverse to adjacent diagonal or edgepixels). In some embodiments, edge detection is performed (e.g., onimage data in the overlapping region), and graph cuts and/or seamcarving may be applied to carve out high frequency content (e.g.,detected edges) of the overlapping region from potential solutions, suchthat candidate seams cannot travel through the graph cut. As a result,high frequency content like the edges of vehicles may be effectivelydesignated as salient to discourage placing seams over detected edges,which may represent visually important information.

As such, seams may be placed in overlapping regions of aligned images toavoid or minimize crossing salient objects or regions, and image datafrom the aligned images may be blended at the seams to create a stitchedimage, such as a 360° (“surround view”) visualization of the environmentsurrounding the vehicle, and a view of the surround view visualizationmay be rendered. The surround view visualization of the environment maytake the form of a stitched panorama, a stitched 360° image, a top-downprojection of a stitched 360° image, a textured 3D geometric surfacemodeling the surrounding environment in the shape of a 3D bowl, arendering of one of the foregoing, and/or other forms. For example,stitched image data may be mapped onto a textured 3D surface in a 3Drepresentation of the environment; a virtual camera may be placed in the3D environment with a specified location and/or orientation and used torender a view of the textured 3D surface from the perspective of thevirtual camera into a viewport; and/or the rendered view may bepresented on a monitor visible to occupants (e.g., driver) of thevehicle.

In some embodiments, a state machine implementing a decision tree isused to determine whether to use a default seam placement or dynamicseam placement that avoids salient objects or regions, and to enable anddisable dynamic seam placement based on speed of ego-motion, directionof ego-motion, proximity to salient objects, active viewport direction,driver gaze, and/or other factors. For example, taking a drivingscenario when the vehicle is traveling forward, or when the viewport ofthe virtual camera is facing forward, when the speed of the vehicle isbelow a lower speed threshold (e.g., less than 5 km/hr) and the closestdetected object is closer to the vehicle than some threshold proximity(e.g., less than 3 m), dynamic seam placement may be enabled. Bycontrast, when the speed of the vehicle is below a lower speed thresholdand there are no detected objects within the threshold proximity, orwhen the speed of the vehicle is within some medium speed range (e.g.,from 5-16 km/hr), dynamic seam placement may be disabled (e.g., in favorof a default seam, such as a horizontal seam). Taking a driving scenariowhen the vehicle is traveling in some other direction (e.g., backwards),or when the viewport of the virtual camera is facing in some otherdirection (e.g., backwards), when the speed of the vehicle is below alower speed threshold (e.g., less than 5 km/hr), dynamic seam placementmay be enabled. In contrast, when the speed of the vehicle is withinsome medium speed range (e.g., from 5-16 km/hr), a seam may be graduallymoved by using a previous frame's seam when it avoids detected (salient)objects from a current time slice, and may otherwise be updated usingdynamic seam placement.

As such, the techniques described herein may be used to dynamicallyplace seams to avoid or minimize crossing salient objects or regions,such as emergency vehicles, road users, and moving objects. Furthermore,some embodiments apply state-dependent image stitching based on speed ofego-motion, direction of ego-motion, proximity to salient objects,active viewport direction, driver gaze, and/or other factors.Dynamically moving seams to avoid or minimize crossing salient objectsor regions can reduce various geometric and/or texture artifacts such asobject disappearance, ghosting, and discontinuities in salient regionsof a stitched image. Furthermore, applying state-dependent imagestitching serves to steer seams away from salient regions, depending onthe driving scenario (e.g., based on the state of the ego-vehicle),effectively avoiding placement of seams over regions of a stitched imagethat are potentially important for a driver to safely maneuver avehicle. As such, the techniques described herein may be used to reducestitching artifacts, improve visual representations of usefulinformation in a stitched image, improve visual quality of stitchedimages, and promote safe operation of the vehicle.

Adaptive 3D Bowl Model of the Surrounding Environment

Systems and methods relating to an adaptive 3D bowl geometry aredisclosed. For example, systems and methods are disclosed that adapt theshape, orientation, and/or dimensions of a 3D bowl (e.g., a mesh)modeling the surrounding environment based on the distance to nearbydetected objects and project image data onto the adapted 3D bowl. Thepresent techniques may be utilized to visualize an environment around anego-object, such as a vehicle, robot, and/or other type of object, insystems such as parking visualization systems, Surround View Systems,and/or others.

In an example embodiment, at a high level, 3D object detection may beperformed to detect objects from sensor data (e.g., images, LiDAR data,RADAR data) representing an environment surrounding an ego-object suchas a vehicle. In some embodiments, sensor data such as a LiDAR pointcloud is projected onto a top-down 2D occupancy grid that representslocations of detected objects or a 3D occupancy grid that representslocations and projected or assumed heights of detected objects (e.g.,assuming vehicles have a height of 2 or 3 meters above the groundsurface). The 3D object detections and/or the 2D/3D occupancy grid maybe used to compute distances to detected objects, and the distances maybe populated in a radial distance map that represents distance to (e.g.,a representative point(s) on) the closest detected object as a functionof angle (e.g., representing a rotation around an axis of the vehiclecoordinate system, such as yaw). The distances and/or the radialdistance map may be used to adapt the shape of a 3D bowl (e.g., a meshor other geometric model), for example, by (re)sizing the ground planeof the 3D bowl to fit within the distance to the closest detectedobject, or modifying the shape, size (e.g., along one or more dimensionsor axes), or orientation (e.g., rotation relative to an ego-vehicle) ofthe 3D bowl to place (e.g., one or more points of) the closest detectedobject(s) between the ground plane of the bowl and the outer edge or rimof the bowl. In some embodiments, a state machine implementing adecision tree is used to determine whether to use a fixed 3D bowl or anadaptive 3D bowl that adapts its shape based on distance to detectedobject(s). As such, image data (e.g., four fisheye images) may beprojected onto the applicable 3D bowl, and a view of the resultingtextured 3D surface may be rendered and presented on a monitor or otherdisplay device visible to occupants (e.g., driver) of the vehicle.

Generally, distances to surrounding objects may be obtained ordetermined in various ways. In some embodiments, 3D object detection isperformed (e.g., by processing sensor data) and/or a representation ofdetected 3D objects (e.g., 3D cuboids in rig coordinates) is accessed.For example, 2D bounding boxes and/or 3D cuboids may be predicted fromimage data (e.g., fisheye images), LiDAR or RADAR detections, and/orother sensor data using one or more machine learning models. As such,distances between the vehicle (e.g., the vehicle center) and one or morerepresentative point(s) of the detected objects (e.g., a closest pointthat belongs to each detected object, a corner or point on an edge of a3D bounding box of each detected object, the average location of Nclosest corners or edges of a 3D bounding box of each detected object)may be computed from predicted and/or sensed locations. Additionally oralternatively, sensor data such as a LiDAR or RADAR point cloudrepresenting detected objects may be projected into a 2D top-downoccupancy grid or 3D top-down occupancy grid (e.g., a height map storingprojected or assumed heights), and 2D or 3D distances to detectedobjects may be computed from the 2D or 3D occupancy grid. Generally, anysuitable ranging and/or depth/distance estimation technique may be usedto determine distances to surrounding objects, such as computer visionand/or neural network techniques (e.g., 3D object detection, multi-viewstereo, structure from motion, dense tracking and motion (DTAM),monocular depth estimation), measurements from depth sensor(s) (e.g.,ultrasonic sensor(s), LiDAR sensor(s), RADAR sensor(s)), and/orotherwise.

In some embodiments, since different techniques may have differentbenefits, multiple techniques and/or sensors may be used (e.g., 3Dobject detection, ultrasonic depth) to generate different types ofdistance or depth estimations, and the different distance/depthestimations may be combined (e.g., projected into a commonrepresentation such as a depth map, height map, or radial distance map;taking a union or intersection of different types of estimations;resolving conflicts by prioritizing certain techniques or types ofsensors; choosing or weighting different estimations based oncorresponding confidence values, etc.).

In some embodiments, detected objects beyond some threshold distance tothe vehicle (e.g., 3 m, 7 m, 15 m, etc.) and/or detected objects outsidea field of view of an active (displayed) viewport may be filtered out orignored (e.g., only considering detected objects within a thresholddistance, only considering objects that are in the field of view of thedisplayed viewport) to free up computational resources and/or reducedistortion resulting from a wider range of detected objects.

As such, distances to detected objects may be computed, and thedistances may be populated in a radial distance map that representsdistance to (e.g., a representative point(s) on) the closest detectedobject (e.g., as a function of (e.g., yaw) angle, in a list of entriesrepresenting distance and direction from ego-center to each corner of adetected object, or in some other structure or form).

Generally, the distances and directions to the surrounding objects maybe used to adapt various shapes of a 3D bowl (e.g., a mesh) that modelsthe geometry of the environment surrounding the vehicle. In terms ofbowl shape, the 3D bowl mesh when viewed from top-down may be circular,elliptical, or some other shape (whether regular or irregular), and theshape may be symmetrical or asymmetrical. In some embodiments, asymmetrical shape such as an ellipse may be used, and an axis of theellipse (e.g., the short or long axis) may be aligned with an axis ofthe vehicle coordinate system (e.g., pointing to the front of thevehicle). Depending on the embodiment, distances to surrounding objectsmay be used to fit a symmetrical bowl shape, synthesize a shape usinglocal deformations to accommodate different distances to differentobjects, or otherwise.

Taking an example symmetrical bowl shape such as an ellipse, a 3D bowlmay be represented as a combination of an elliptical ground plane (aninner bowl) and an elliptical side rising from the ground plane to apeak height at the bowl edge or rim (an outer bowl). The inner and outerbowl may be separated by some distance, whether constant (e.g., 3 m) orvariable, and the inner bowl may be sized to fit within the distance tothe closest detected object. For example, a representative point on aclosest detected object may be assumed to be at some point along thecircumference of an ellipse that is aligned with and centered on thevehicle, and the ellipse may be parameterized using the standardequation for an ellipse to calculate representative parameters for theinner bowl (e.g., values of the short and long axes, foci, eccentricity,circumference, etc.). Additionally or alternatively, the representativepoints of multiple (e.g., filtered, clustered) detected objects may beused to fit an ellipse (or multiple ellipses), or some other shape,optionally rotated (e.g., to align a long or short axis with aparticular detected object). In some embodiments, an optimizationalgorithm may be used to fit an ellipse (or some other shape), forexample, by iterating over candidate radii and minimizing a costfunction that evaluates the candidate radii based on the sum of thedistances between the circumference of the candidate ellipse and therepresentative point(s) for each closest detected object (e.g.,prioritizing/discarding particular classes of detected object (e.g.,selecting or weighting vehicles over vegetation),prioritizing/discarding particular instances of detected objects (e.g.,placing higher weights on closer objects)). These are just a fewexamples, and other ways of adapting the inner and/or outer bowl basedon distance to a detected object may be implemented. In someembodiments, the 3D bowl is parameterized by parameters of the innerbowl and the outer bowl (e.g., short and long axes of the inner andouter bowls).

In some embodiments, the 3D bowl may be constructed using the distanceto each point radially surrounding the vehicle. For example, thedistance to the closest detected object in each radial direction may bedetermined, the resulting radial values may be used as the shape (or maybe used to fit a shape) for the inner bowl, and the outer bowl may beset to some fixed separation from the inner bowl (e.g., 3 m). To smoothout the bowl shape, a spatial filter may be applied over some angularrange (e.g., 20-60°).

Generally, any number of spatial and/or temporal filters may be appliedto parameters (or a combination of parameters) representing the 3D bowl(e.g., short and long axes of an elliptical bowl). A speed filter may beapplied to enable or disable the adaptive bowl based on speed of theego-vehicle (e.g., disabled at speeds above some threshold such as 16km/hr, and enabled at lower speeds). In some embodiments (e.g., that usean elliptical bowl), a temporal filter may be applied over a temporalwindow (e.g., 30 frames of data buffered first-in-first-out) to a ratioof parameters (e.g., the ratio of the short to long axes) to stabilizethe ratio, to individual parameters (e.g., short and long axes of aninner bowl), and/or otherwise. In some embodiments, temporal filteringis applied to a ratio before being applied to individual parameters inorder to apply course filtering before finer smoothing. In someimplementations, stochastic and/or Kalman filtering may be applied overa temporal window to smooth out transitions. These are just a fewexamples, and other filters may additionally or alternatively beapplied.

In some embodiments, a state machine implementing a decision tree isused to determine whether to use a fixed 3D bowl or an adaptive 3D bowlthat adapts its shape based on distance to detected object(s). In somescenarios, applying an adaptive 3D bowl to passing objects may createundesirable distortions or artifacts that can manifest as a blurred or“wobbling” effect. To avoid or reduce these distortions, an adaptive 3Dbowl may be selectively enabled in scenarios less likely to involvepassing objects, for example, based on speed of ego-motion. For example,at low speeds, a vehicle may be embarking on a journey (e.g., from astopped and/or parked position), approaching or stopping at anintersection such as a road sign or traffic light, or performingmaneuvers at low speeds. Since a vehicle embarking on a journey is lesslikely to encounter passing objects, in some embodiments, when the speedof the vehicle is below a lower speed threshold (e.g., less than 1km/hr) and the vehicle is embarking on a journey, adaptive 3D bowl maybe enabled. When the speed of the vehicle is below a lower speedthreshold (e.g., less than 1 km/hr) and the vehicle is not embarking ona journey, adaptive 3D bowl may be disabled (e.g., locked to a fixedbowl or the previous bowl shape). When the speed of the vehicle iswithin some medium speed range (e.g., from 1-16 km/hr), adaptive 3D bowlmay be disabled. When the speed of the vehicle is above some mediumspeed range, the vehicle may be more likely to encounter passingobjects, so adaptive 3D bowl may be disabled (e.g., and locked to afixed bowl or the previous bowl shape).

As such, image data (e.g., four fisheye images) may be projected ontothe applicable 3D bowl, and a view of the projected image data may berendered and presented on a monitor visible to occupants (e.g., driver)of the vehicle. Adapting the shape of a 3D bowl based on the distance tonearby detected objects serves to reduce scale magnification and grossdistortion artifacts for what are often the most important objects, theclosest ones to the vehicle. As such, the techniques described hereinmay be used to generate improved visualizations that reduce visualartifacts, better represent useful visual information in a surround viewvisualization, and promote safe operation of the vehicle.

Under Vehicle Reconstruction

Systems and methods relating to reconstruction of the area under avehicle are disclosed. For example, systems and methods are disclosedthat use cached sensor data captured by a vehicle and ego-motion of thevehicle to reconstruct the area under the vehicle in real time. Thepresent techniques may be utilized to visualize an environment around anego-object, such as a vehicle, robot, and/or other type of object, insystems such as parking visualization systems, Surround View Systems,and/or others.

At a high level, sensor data captured over time by an ego-object may becached to generate a map (e.g., a plurality of local maps, a compositemap) that visualizes a drivable area, and ego-motion of the vehicle maybe used to retrieve sensor data from the map to virtually reconstructthe area under the vehicle. In an example implementation, one or moresensors such as a front or back fisheye camera on a vehicle are utilizedto build a local textured map of the environment (e.g., the drivablearea) through which the vehicle travels, and maps generated at differenttimestamps may be cached and/or combined into a composite map. Thevehicle's location and orientation within the local and/or composite mapmay be estimated, for example, using the vehicle ego-motion and/or pose.As such, a portion of the local and/or composite map corresponding tothe area under the vehicle may be retrieved, and a representation of theretrieved portion may be presented as a virtual reconstruction of theunder vehicle area. In some embodiments, this reconstructed undervehicle area (under vehicle reconstruction, or UVR) may be stitchedtogether with a surround view visualization to provide a new surroundview visualization with greater visible area.

As such, even though there may not be an image of the under vehicle areacaptured at any given moment, that area may be reconstructed from sensordata captured at a previous time. In some embodiments, the sensor (e.g.,camera) used to cache observed sensor data for future UVR usage dependson the direction of ego-motion. For example, if a vehicle is movingforward, image data from a front-facing (e.g., fisheye) camera may beused, and/or if a vehicle is moving backward, image data from arear-facing (e.g., fisheye) camera may be used. In some embodiments,such as those involving longer ego-objects such as long trucks orvehicles with trailers making sharp turns, image data from side (e.g.,fisheye) cameras may additionally or alternatively be used. In someembodiments, sensor data from multiple sensors (e.g., fisheye cameras)is stitched together to form a composite view (e.g., a surround view or360° image), and the composite view is used for UVR. Generally, anynumber and/or type of sensor may be used to capture sensor data forunder vehicle reconstruction. In a scenario where the vehicle isstarting from a parked position where ego-motion has not yet started,previously cached (e.g., the last cached) sensor data or UVR may beutilized (e.g., stitched) with current sensor data.

In an example implementation, sensor data is periodically captured togenerate frames of sensor data (whether each frame is captured from anindividual sensor or combined using sensor data captured from multiplesensors), where the frames represent different time slices (e.g., avideo feed). For each time slice, the under vehicle area is retrievedfrom the local and/or composite map using ego-motion and/or pose of thevehicle. In some embodiments, in response to a new frame of sensor databeing captured or generated, two tasks may be conducted. In one task,the sensor data (e.g., image data) from the new frame may be cached(e.g., in its entirety) and/or used to update a composite map. Inanother task, the under vehicle area at the moment represented by thetime slice may be calculated and corresponding sensor data may beretrieved using ego-motion of the moment. In some embodiments, arepresentation of the under vehicle reconstruction is presented, forexample, on a monitor visible to occupants (e.g., driver) of thevehicle. In one or more embodiments, the representation of the undervehicle reconstruction may be provided to a remote viewer (e.g., aremote operator), via, for example and without limitation, streamingover a wireless communication channel. In some embodiments, the undervehicle reconstruction is stitched together with the sensor data fromwhich it was generated, a surround view, and/or a 360° visualization ofthe environment surrounding the vehicle, for example, forming a textured3D geometric surface (e.g., in the shape of a bowl) that models thesurrounding environment. A virtual camera may be specified and used torender a view of the 3D surface from the perspective of the virtualcamera, and/or the rendered view may be presented on a monitor visibleto the driver of the vehicle.

In some embodiments, one or more deep neural networks (DNNs) are used todetect or otherwise distinguish drivable space from moving objects, suchas pedestrians, bicycles, and vehicles, represented in the capturedsensor data and/or in the composite view of the environment, and sensordata representing moving objects is omitted from the local and/orcomposite map. This way, moving objects that are not part of thedrivable space should not appear in the reconstructed area under thevehicle, and the driving surface under the vehicle may be visualized.

The present techniques may be used in a variety of visualizations and/orAdvanced Driving Assistive System (ADAS) features. For example, a“Transparent Hood” visualization may render a 3D model of the vehicleover the reconstructed area, but with the hood of the vehicletransparent, allowing a driver to view the relative positions of thevehicle's front wheels and the ground. In another example, a“Transparent Vehicle” visualization may render the entire 3D model ofthe vehicle transparently, allowing a driver to view the reconstructedunder vehicle area through the transparent vehicle. In another example,a “Transparent Trailer” visualization may render a 3D model of thetrailer (and/or a vehicle it is attached to) transparently. These arejust a few examples, and other visualizations may additionally oralternatively be provided.

As such, the techniques described herein may be used to cache sensordata representing a navigable surface into a composite map, andreconstruct the area under an ego-object using the composite map and theego-motion of the ego-object. Unlike conventional visualizations thatomit or occlude the under vehicle area, the present techniques provide areal-time view of the surface under the vehicle, improving driverperception of the surrounding environment, and therefore improving thedriver's ability to perform many tasks, such as passing over speedbumps, avoiding potholes, driving on narrow roads with high curbs orunpaved terrain, and/or other maneuvers.

Optimized Visualization Streaming

Systems and methods relating to streaming a representation of anenvironment in and around an ego-object are disclosed. For example,systems and methods are disclosed that stream, render, and/or otherwisedeliver a representation of various types of sensor data to a remotelocation to facilitate various remote experiences. The presenttechniques may be utilized to visualize and/or stream a representationof an environment in and around an ego-object, such as a vehicle, robot,and/or other type of object, in systems such as parking visualizationsystems, Surround View Systems, and/or others.

At a high level, sensor data may be captured by sensors of anego-object, such as a vehicle traveling in a physical environment, and arepresentation of the sensor data may be streamed from the ego-object toa remote location to facilitate various remote experiences, such asstreaming to a remote viewer (e.g., a friend or relative), streaming toa remote or fleet operator, streaming to a mobile app configured toself-park or summon an ego-object, rendering a 3D augmented reality (AR)or virtual reality (VR) representation of the physical environment,and/or others. In some embodiments, the stream includes one or morecommand channels used to control data collection, rendering, streamcontent, or even vehicle maneuvers, such as during an emergency,self-park, or summon scenario.

Depending on the embodiment and/or the scenario, the streamed contentmay take a variety of forms. Generally, an object in an environment(e.g., an ego-object such as a vehicle, robot, or person; a stationaryobject such as a sign, pole, wall, or bridge) may be affixed with one ormore sensors, such as cameras, microphones, ultrasonic sensors, RADARsensors, LiDAR sensors, or infrared sensors, to name a few examples. Insome embodiments involving a vehicle with a cabin, the vehicle may beequipped with external sensors capturing a representation of the outsideenvironment and/or internal sensors capturing a representation of theenvironment inside the cabin. In some cases, raw sensor data may bestreamed (e.g., video feed(s), LiDAR data, RADAR data, audio narration).Additionally or alternatively, some rendering or other representation ofthe raw sensor data may be streamed. Taking a surround viewvisualization as an example, sensor data such as image or LiDAR data maybe projected onto a 3D representation (e.g., a 3D surface), and avirtual camera may be used to render a view of the 3D representation ina 2D viewport. In some embodiments, the stream may transport the 3Drepresentation, the 2D viewport, and/or some segment or portion thereof.The position and orientation of the virtual camera may be controlled invarious ways, such as by a remote command (e.g., enabling a remoteperson to control what perspective is streamed); based on occupant gaze,head pose, or body pose (e.g., detected using a Driver Monitoring System(DMS) and/or an Occupant Monitoring System (OMS)); based on a drivingscenario (e.g., parking, direction and/or speed of ego-motion); and/orotherwise.

Taking directional or surround audio as another example, an object(e.g., an ego-object) such as a vehicle may be affixed with multiplemicrophones around the vehicle. In some embodiments, raw audio data maybe streamed to some remote location where directional or surround audiomay be computed. In some cases, directional or surround audio may becomputed at the vehicle and streamed. In either scenario, thedirectional or surround audio may be provided to and played back at aremote location, for example, to facilitate an immersive experience fora remote viewer or operator. In some embodiments, environmental noise(e.g., road noise) may be removed in various ways (e.g., using one ormore machine learning models to subtract noise from an audio signal). Insome embodiments, a remote operator may issue one or more commandssteering the directional or surround audio. Additionally oralternatively, the environment may be analyzed for saliency (e.g.,whether at the vehicle or at some remote location), and a direction ofsaliency may be used to emphasize audio from that direction. Forexample, an audio signal may be analyzed (e.g., by one or more machinelearning models) to detect an emergency noise or other salient eventsuch as a vehicle horn, siren, screeching tires, a collision, and/orothers, and detection of the emergency noise or other salient event froma microphone pointing in a particular direction may be used to steer thedirectional audio in that direction.

In some embodiments, other sensor data may additionally or alternativelybe used to detect an emergency or other salient event, and the detectedevent may be used to steer capturing or rendering of sensor data towardsthe direction of the detected salient event. For example, image data,video data, proximity data, LiDAR data, RADAR data, and/or other sensordata may be analyzed (e.g., by one or more machine learning models), anddetection of a salient event may be used to steer directional audioand/or a viewport rendering of a 3D representation of the environmenttowards the direction of the detected event. The directed viewportrendering may be presented on a monitor visible to occupants (e.g.,driver) of the vehicle, presented on a monitor at a remote location,presented in association with some alarm, presented picture-in-picture(e.g., with some other video feed such as one pointing in the directionof travel), and/or otherwise. In some embodiments, the detected salientevent (e.g., an imminent collision) may be used to trigger commandeeringcontrol of the vehicle, avoiding a collision, and/or steering thevehicle to safety.

Generally, a transport system that streams a representation of thephysical environment to a remote location may be triggered andcontrolled in various ways. In some embodiments, capture, rendering,and/or transport may be triggered by ego-object location (e.g., when avehicle reaches a particular intersection, latitude/longitude,geofence), triggered by a detected salient event (e.g., an accident,encountered emergency vehicle), triggered at a designated time,triggered on demand (e.g., by a vehicle occupant, remote operator),and/or otherwise. The transport system may include one or more commandchannels configured to deliver and trigger execution of remotely issuedcommands (e.g., via remote control functionality, such as that providedby NVIDIA Remote Control (NVRC) technology).

The transport system may use any streaming technology to transport therepresentation of the environment over a wireless communication channelto a remote location. For example, the transport system may includededicated communication channels for each type of content (e.g., one ormore types of sensor data, one or more types of rendered content,two-way audio, each type of command), may support scalable streams(e.g., using scalable audio or video coders to adjust encoding qualitybased on bandwidth), and/or may implement a Quality of Service (QoS)mechanism to assign a priority to certain streamed content and/orcommands and manage the stream accordingly. In some embodiments, thetransport system implements a stream hierarchy that prioritizesparticular types of content. For example, in some implementations thatprioritize a rendered AR/VR visualization, the rendered AR/VRvisualization may be prioritized for streaming and other sensor datasuch as LiDAR data may be deprioritized and dropped (e.g., to conservebandwidth). In another example, in some implementations that delivercontent to a mobile device, LiDAR data may be dropped because the mobiledevice may lack the functionality to handle that type of data. In somescenarios, raw sensor data may be streamed and all other content may bedropped. In yet another example, different modalities of sensor data mayhave different assigned priorities, and deprioritized sensor data may bedropped. These are just a few examples, and other hierarchies arecontemplated within the scope of the present disclosure.

Depending on the implementation, the remote location may include one ormore servers (e.g., a web or application server), one or more clientdevices (e.g., a mobile app on a mobile device), a data center ordistributed computing platform, and/or others. In some embodiments, oneor more servers are used to receive streamed content, generate arendering (e.g., a horizontal 360° surround view visualization, arendered AR/VR environment, a 2D viewport of a 3D environment), generatean adaptive 3D bowl on which to project a surround view visualization,and/or distribute streamed and/or generated content to one or morerecipient (e.g., client) devices.

As such, the streaming techniques described herein may enable a varietyof remote experiences. In an example remote user experience, a remoteuser operating a computing device may choose (and issue commandsconfiguring) a particular view for rendering, may interact with avehicle occupant (e.g., via two-way audio), and/or may issue commandssteering a viewport, soundport, or sensorport (e.g., whether renderingis performed at the ego-object, on the device operated by the remoteuser, or at some other location). In some embodiments, a vehicleoccupant may stream to a friend or remote operator at a remote location(e.g., using a mobile app, a monitor visible to a vehicle operated bythe friend, an AR or VR headset). In some embodiments, multiple vehiclesin a fleet may stream to a location where fleet monitoring occurs. In anautonomous or semi-autonomous driving application, a user may operate amobile app on his or her mobile device, where the application isconfigured to communicate with his or her vehicle to initiate autonomousor semi-autonomous maneuvering (e.g., self-parking, summoning of thevehicle, traveling along designated waypoints or a traced path, etc.),and video or some other sensor data may be streamed to the mobile app sothe user can monitor the vehicle while it is self-maneuvering. Inanother application, if there is a failure in autonomous maneuvering(e.g., a slow-down, stoppage in a lane) or a catastrophic health event,the vehicle may connect to an emergency operator and transmit a videofeed or other sensor data, enabling the emergency operator to assess thesituation and potentially take control and navigate the vehicle tosafety.

In another example application, a 3D (e.g., AR or VR) representation ofa physical (e.g., driving or navigational) environment may be renderedwith a digital twin or other virtual representation of one or moreego-objects in the physical environment. For example, position and/ororientation of any number of ego-objects (e.g., different vehicles) inthe physical environment may be sensed, streamed, and/or used to updatethe position and/or orientation of their corresponding virtualrepresentations in an AR/VR representation of the physical environment,substantially in real-time. In some embodiments, the AR/VRrepresentation of the physical environment may be rendered with arepresentation of sensor data (e.g., image data, LiDAR data, RADAR data,temperature, audio) detected and streamed by one or more objects (e.g.,ego-objects) in the physical environment. As such, an AR or VRrepresentation of the physical environment may be rendered and presentedin a headset worn by a remote user to provide an immersive experiencerepresenting the physical environment through which an ego-object istraveling.

As such, the techniques described herein may be used to capture, stream,render, and/or present a representation of a physical environmenttraversed by an ego-object to facilitate various remote experiences,from sharing a surround view rendering with your friend, to remote fleetmonitoring, to rendering an immersive AR/VR representation of thephysical environment, and other applications.

Example Surround View System

With reference to FIG. 1 , FIG. 1 is an example Surround View System100, in accordance with some embodiments of the present disclosure. Itshould be understood that this and other arrangements described hereinare set forth only as examples. Other arrangements and elements (e.g.,machines, interfaces, functions, orders, groupings of functions, etc.)may be used in addition to or instead of those shown, and some elementsmay be omitted altogether. Further, many of the elements describedherein are functional entities that may be implemented as discrete ordistributed components or in conjunction with other components, and inany suitable combination and location. Various functions describedherein as being performed by entities may be carried out by hardware,firmware, and/or software. For instance, various functions may becarried out by a processor executing instructions stored in memory. Insome embodiments, the systems, methods, and processes described hereinmay be executed using similar components, features, and/or functionalityto those of example autonomous vehicle 3500 of FIGS. 35A-35D, examplecomputing device 3600 of FIG. 36 , and/or example data center 3700 ofFIG. 37 .

As a high level overview, the Surround View System 100 may beincorporated into an ego-object, such as the autonomous vehicle 3500 ofFIGS. 35A-35D. The Surround View System 100 may include any number andtype of sensor(s) 101 such one or more cameras that capture sensor data(e.g., image data 105) representing the surrounding environment. TheSurround View System 100 may use the image data 105 to generate asurround view visualization representing the surrounding environment,and/or present it on a display 190 visible to an occupant or operator ofthe ego-object (e.g., a driver or passenger). Additionally oralternatively, a representation streaming engine 195 may stream thesurround view visualization, the sensor data, and/or some otherrepresentation of the environment in and/or around the ego-object to aremote location.

In an example embodiment, an ego-object (e.g., the autonomous vehicle3500 of FIGS. 35A-35D) is equipped with any number and type of sensor(s)101 (e.g., one or more cameras, such as fisheye cameras), and thesensor(s) 101 may be used to capture frames of overlapping sensor data(e.g., overlapping image data) for each time slice. Generally, anysuitable sensor may be used, such as one or more of the stereo camera(s)3568, wide-view camera(s) 3570 (e.g., fisheye cameras), infraredcamera(s) 3572, surround camera(s) 3574 (e.g., 360° cameras), and/orlong-range and/or mid-range camera(s) 3598, of the vehicle 3500 of FIG.35A. Typically, different sensors have their own 3D coordinate systems.As such, some embodiments align sensor data from the sensor(s) 101(e.g., image data) in a coordinate system defined relative to theego-object, such as a vehicle rig coordinate system, as described inmore detail with respect to FIG. 25 . Additionally or alternatively, theenvironment surrounding the ego-object may be modeled in a global 3Dcoordinate system (world space), and the sensor data may be aligned inthe global 3D coordinate system. In an example configuration, fourfisheye cameras are installed at the front, left, rear and right side ofa vehicle, where surrounding videos are continuously captured.Ego-motion of the vehicle may be generated using any known technique andsynchronized with timestamps of the frames (e.g., images) of the videos.For example, absolute or relative ego-motion data (e.g., location,orientation, positional and rotational velocity, positional androtational acceleration) may be determined using a vehicle speed sensor,gyroscope, accelerator, inertial measurement unit (IMU), and/or others.

In some embodiments, the Surround View System 100 includes a dewarpmodule 110, a stitching module 120, an under vehicle reconstructor 130,an adaptive 3D bowl generator 150, a projection module 175, a viewgenerator 180, and/or a representation streaming engine 195. Taking anexample implementation in which the sensor(s) are camera(s) (e.g.,fisheye cameras) that capture image data 105 (e.g., fisheye images), thedewarp module 110 may remove distortion (e.g., barrel distortion, radialdistortion) from the image data 105 using any known technique. Thestitching module 120 may align and stitch the resulting image data intoa stitched image, as explained in more detail with respect to FIGS. 2-12. The under vehicle reconstructor 130 may use cached sensor data fromthe sensor(s) 101 (e.g., the image data 105, dewarped image data) andego-motion of the ego-object to reconstruct the area under theego-object, as explained in more detail with respect to FIGS. 23-31 . Assuch, the under vehicle reconstructor 130 may provide the resultingreconstruction to the stitching module 120 for inclusion in a stitchedimage.

In some embodiments, the Surround View System 100 uses a 3D model of thesurrounding environment such as a 3D bowl to generate a surround viewvisualization. In some embodiments, the 3D bowl has an adaptive shapethat depends on distance(s) and/or direction(s) to detected object(s).For example, the adaptive 3D bowl generator 150 may use sensor data fromthe one or more sensor(s) to generate an adaptive 3D bowl 170 thatmodels the environment with a shape that depends on distance and/ordirection to detected objects in the environment, as explained in moredetail with respect to FIGS. 13-22 . For example, the adaptive 3D bowlgenerator 150 may include a depth estimator 155 that estimates distanceand/or direction to detected object(s) in the environment (e.g., using3D objection detection, using a projection of sensor data onto atop-down 2D occupancy grid), and a 3D bowl adapter 160 may fit a shapefor the adaptive 3D bowl 170 (e.g., by deforming an initial 3D bowl 165)based on the distance(s) and/or direction(s) to detected object(s).

The projection module 175 may project the stitched image generated bythe stitching module 120 to generate a projection image (e.g., atop-down projection image) using estimated depth values (e.g., generatedby the depth estimator 155), depth values sampled from a fixed 3D bowl,depth values sampled from the adaptive 3D bowl 170, and/or otherwise.Additionally or alternatively, the projection module 175 may map thestitched image generated by the stitching module 120 (or correspondingimage data 105 or dewarped image data) onto a fixed 3D bowl, theadaptive 3D bowl 170, or some other 3D representation of the surroundingenvironment to generate a textured 3D model of the environment (e.g., atextured 3D bowl).

As such, the view generator 180 may output or render a view of one ormore of the foregoing, as described in more detail with respect to FIGS.13, 19, and 23 . For example, the view generator 180 may position andorient a virtual camera in a 3D scene with the textured 3D bowl andrender a view of the textured 3D bowl from the perspective of thevirtual camera through a corresponding viewport. In some embodiments,the viewport may be selected based on a driving scenario (e.g.,orienting the viewport in the direction of ego-motion), based on adetected salient event (e.g., orienting the viewport toward the detectedsalient event), based on an in-cabin command (e.g., orienting theviewport in a direction instructed by a command issued by an operator oroccupant of the ego-object), based on a remote command (orienting theviewport in a direction instructed by a remote command), and/orotherwise. As such, the view generator 180 may output a visualization(e.g., a surround view visualization) of the surrounding environment tothe display 190 (e.g., a monitor visible to an occupant or operator ofthe ego-object). Additionally or alternatively, a representationstreaming engine 195 may stream the surround view visualization, thesensor data, and/or some other representation of the environment inand/or around the ego-object to a remote location, as explained in moredetail with respect to FIGS. 32-34 .

Image Stitching

Approaches in accordance with various embodiments provide for thegeneration, editing, or manipulation of image or video data. Inparticular, various embodiments provide for the generation of compositeimage or video data, using a system and methodology for combining (e.g.,stitching) discrete image data into contiguous, composite surround orpanoramic views. This may include, for example and without limitation,stitching individual frames captured using multiple sensors in anenvironment with at least partially overlapping fields of view such thateach frame provides a larger field of view than is captured in anyindividual frame of the frames to be stitched together, such as a 180°or 360° view. At least some of the pixels may have their color valuesblended during the stitching or compositing process.

Stitched or composite image data generated using such a process may beutilized for various purposes. It should be understood that, at leastfor convenience of explanation, “image” data may be referred to herein,although image data may take many different forms, such as whereindividual images correspond to individual frames of video, or whereimage data is used to generate immersive video, augmented reality (AR),virtual reality (VR), or mixed reality (MR) experiences. In one exampleuse case, live video data may be captured for a device or system such asa robot, or (autonomous or semi-autonomous) vehicle (collectively, anego-vehicle). It may be desirable to combine the live video data frommultiple cameras to generate a consistent view of the environment, suchas a full view of the environment surrounding around that device ormachine, which may be presented to an operator, monitoring system,occupant or passenger, remote viewer, or other such entity. For many ofthese systems, it may be helpful to generate such composite videorepresentations in real time, such that an operator or monitoring systemcan take actions, if needed, based on real time observations.

Consider an autonomous or semi-autonomous vehicle that may include oneor more passengers or occupants. The vehicle may include, or beassociated with, multiple cameras (or at least one multi-part camera)that capture images of an environment in which that vehicle is located.In order to provide a seamless visualization of the environment to thepassengers, it may be desirable to stitch or composite these images intoa single representation of the environment that updates in real time,such as while the vehicle is in motion. Such a composite representationmay also be transmitted to a remote system, such as a monitoring orcontrol system that may analyze performance of the vehicle orpotentially make changes in operation of the vehicle based at least inpart upon aspects of the environment as determined from thisrepresentation. It may be desirable for these visualizations to be asrealistic and free from defects, artifacts, or distortions as possible.

FIGS. 2A-2B illustrate an example of image stitching, in accordance withsome embodiments of the present disclosure. FIG. 2A illustrates a set offour constituent images 200 that can be captured by respective camerason, or associated with, an autonomous vehicle (e.g., the exampleautonomous vehicle 3500 of FIGS. 35A-35D) or other system or device inaccordance with various embodiments. In this example, the cameras havedifferent but partially overlapping fields of view, such that the imagesmay be stitched together without gap filling or additional image datageneration, although it should be understood that there may be othersituations where cameras do not produce images with at least partiallyoverlapping views that may require such tasks. In this example, eachcamera may have various intrinsic or extrinsic values that can impactthe appearance of a captured image, where those values may relate tofield of view, optical center, focal length, or camera pose, among othersuch options.

As mentioned, it may be desirable to generate a single, consistent viewof this surrounding environment based at least in part upon thesecaptured images 200. This may include, for example, generating acomposite image 250 as illustrated in FIG. 2B. As illustrated, such acomposite image may provide a single, consistent representation of theenvironment, whether a full 360° view or at least a portion of theangular range. Such a view may be presented as a single view showing theentire image 250, or portions of the image may be displayed at differenttimes, where that view may be controllable by a user. In this example,an angular shape alpha map may be utilized instead of, for example, aperpendicular shape alpha map, at least to avoid the alpha discontinuityissue observed in various traditional stitching techniques. Blendingregion color continuity may be improved by extending the blendingregion(s) to hide a color discontinuity issue around corners or othersuch features.

FIG. 3 is a diagram illustrating an example data flow through an exampleimage stitching system 300, in accordance with some embodiments of thepresent disclosure. In this example, a set of camera images 302 may bereceived as input, and may be received from multiple cameras associatedwith a given device or system (e.g., the example autonomous vehicle 3500of FIGS. 35A-35D), or positioned with respect to a given environment,among other such options. In at least one embodiment, these imagesrepresent two or more different views of an environment that are, atmost, partially overlapping. These images may represent a full, orpartial, view of a scene, location, or environment. This image data mayinclude “live” data that is streamed or transmitted shortly after imagecapture, or may include image data captured previously and provided in amore offline fashion.

In this example, the camera images may be provided to a stitching module304 (some or all of which, may correspond to the stitching module 120 ofFIG. 1 ) that may utilize these input camera images 302 to generate anoutput view or composite image, or video stream, for presentation via atleast one display 324, such as a monitor, projector, or wearabledisplay, among other such options. In this example, the camera images302 may be provided to a stitching component 306 that will attempt tostitch multiple images together to generate a composite representation.The stitching module may use any of a number of different stitching orcompositing algorithms, and may perform various blending or other imagemanipulation techniques. In at least one embodiment, the stitchingmodule may utilize various intrinsic and extrinsic parameters of thecameras, at least to the extent values for these parameters are knownand available from a camera database 308 or other such location, inorder to properly assign the image data for compositing. This mayinclude, for example, information such as the relative poses ororientation of these different cameras, such that at least an initialstitching position and orientation may be determined for each image. Itshould be understood that for image stitching and/or view generatorsystems that receive sequences of images or streams of video frame data,the images or frames to be composited may be those that correspond to,or were captured at, the substantially same point in time, at least toan extent to which such capture can be synchronized, or otherwiserepresent the same time slice.

The camera calibration parameters may be used to map the camera viewsinto a stitched space, such as may correspond to a top-down view or“bowl” view in a projection space. This projection may be used toidentify any overlapping regions between adjacent cameras, where one ormore blending algorithms may be used to blend at least some of thepixels to make the stitching less visible or apparent. The stitchingcomponent 306 may utilize values for one or more stitching parameters,as may be stored to a parameter database 310 or other such location.Values for these parameters may determine aspects of how componentimages are stitched together, as may relate to weightings or locationsfor blending and other such aspects. Example stitching parametersinclude, but are not limited to, blending method (e.g., alpha blendingor multiband blending), blending width, blending alpha map shape (e.g.,angular based or perpendicular based), seam type (e.g., diagonal seam,vertical seam, or horizontal seam), and seam location. The stitchingcomponent 306 may utilize values for these various parameters with theinput constituent images to generate a composite image, or stitchedimage 312, that provides a single representation of an environment,scene, or location at a point in time, such as a “current” point intime, accounting for some amount of latency in transmission andprocessing.

If a scene is not harmonized between cameras, it may be desirable toutilize a larger blending weight or radius (e.g., 200) to provide for asmoother transition between data from images. If it is a highlystructured scene with lots of buildings and edges, for example, it maybe desirable to utilize a smaller blending weight (e.g., 2) to avoidghosting and other artifacts due to misalignment between cameras. Singleband blending may be utilized where images are blended in only one band(or a subset) of multiple bands. In at least one embodiment, constituentimages are decomposed into different frequency bands or components, anddifferent blending weights may be used for each of these bands orcomponents.

In order to attempt to provide stitched images of high subjectivequality, some amount of processing of a stitched image may be performedto attempt to assess the quality, as well as to use a result of thatassessment to make any changes to the stitching parameters that may bedesirable to improve the quality, at least where the determined qualityis below a target or threshold value or determination. In this example,both the stitched image 312 from the stitching component 306, and theconstituent images 314 used to generate that stitched image, may beutilized for a quality assessment determination. In some embodiments theconstituent images may correspond directly to the input camera images302, while in some embodiments these constituent images may have had atleast some amount of processing performed, such as to reduce variationsin brightness, color, or contrast, or to reduce a presence of noise orremove image artifacts, among other such options. In at least oneembodiment, removing or reducing a presence of image artifacts in theindividual constituent images before stitching may result in a higherquality stitched image.

In this example, the stitched image 312 and constituent images 314 maybe processed to produce image data that better provides for one or morespecific types of comparison. In at least one embodiment, this mayinclude utilising a high pass filter 316 (or edge or feature detector)on the images to enhance or identify edges or other prominent featuresin the images. In some embodiments, a quality assessment 322 may beperformed using any known technique, and if a measure of the qualityassessment is below some designated threshold, the image data may beoptimized in a loop, for example, by applying one or more geometric orphotometric transforms and feeding the transformed image data back tothe stitching component 306.

Dynamic Seam Placement and State-Dependent Image Stitching

One or more embodiments are directed to techniques for robust dynamicstitching of multiple images. These techniques maybe implemented in aSurround View System, such as one for a vehicle's cockpit, a vehicleoccupant, or robotic remote operator, as example non-limitingembodiments. Existing systems move stitching seams to avoid cuttingobjects, where objects are detected by an Ultrasound Sensor (USS).Unfortunately, the typical usable distance of commercial USS systems isvery short, thus only very close objects can be detected. Furthermore,the existing solution treats different viewports the same, and theviewport is selected manually. However, vehicle or robotic operators oroccupants may have different preferences on where the seam locationshould be under different viewports, and automatically selectingviewports which best fit the underlying driving scenario is preferred.

As such, one or more embodiments are directed to a system that can beused with longer distance ranges and larger speed ranges relative toconventional systems, and provides improved visualization quality viaviewport-dependent dynamic seam movement and automatically selectingviewports. For example, one or more embodiments provide a dynamic seammovement system that incorporates object detection, ego-vehicle speed,ego-vehicle moving direction, active viewport, driver gaze (e.g., a gazeheatmap), and/or visual saliency (e.g., a saliency mask) into a unifiedstate machine that determines whether to use a default seam placement ordynamic seam placement that avoids salient regions or objects to achievea better quality stitched view and automatic view selection. In someembodiments, dynamic seam placement may be enabled or disabled based onspeed of ego-motion, direction of ego-motion, proximity to salientobjects, active viewport, driver gaze, and/or other factors.

For example, according to one or more embodiments, a (e.g., visual)saliency mask may be computed to represent (and place higher weightingon) region(s) where drivers typically pay more attention to. Forexample, driver gaze may be monitored (e.g., using a camera or drivermonitoring system), the region of the environment where the driver islooking may be determined from detected and projected into acorresponding region of a saliency mask, and the corresponding region(e.g., pixels corresponding to the driver's field of view, pixels withina bounding shape of a detected object the driver is looking at) may beupdated to represent a measure of saliency. In this manner, a saliencymask that represents driver gaze may be used to steer seams away fromregions where a driver is looking (e.g., when dynamic seam placement isenabled).

In embodiments, seams are placed to avoid cutting salient objects (e.g.,certain detected objects) or salient regions (e.g., where a driver islooking based a detected direction of gaze). FIG. 4A shows an examplesurround view visualization 400 with a seam 404 placed vertically toavoid cutting a salient object or region (e.g., detected surroundingcars, regions where a driver is looking), in accordance with someembodiments of the present disclosure. When parking in between two cars,a driver may continuously check the distance from the ego-vehicle toneighboring or proximate cars and other objects to avoid collision.Accordingly, a saliency mask according to one or more embodiments mayassign higher values or weights on the two cars, and the saliency maskmay be used as input in dynamic seam placement. For example, in one ormore embodiments, a position of one or more seams is adjusted to avoidregions assigned higher weights in a saliency mask. In FIG. 4A,depicting an example scenario, the two vehicles on each lateral/profileside of the center ego-vehicle may be assigned higher weights in asaliency mask, and a determination may be made to use a vertical seam(seam 404) to avoid stitching in portions of the underlying images thatinclude or intersect the neighboring vehicles, thereby reducingdistortion of the vehicles in a display or visualization.

In contrast, when driving on an open road, the driver may look morefrequently in the front view rather than on the lateral sides.Accordingly, a saliency mask during this period of operation may assignhigher weights to pixels corresponding to the front view and lowerweights to pixels corresponding to the lateral sides. In this scenario,placement of one or more seams may be dynamically determined or adjustedto a horizontal or substantially horizontal position, placing a seam ina region corresponding to the lateral sides of the ego-vehicle to avoidthe higher weighted saliency region in front of the ego-vehicle.

By way of illustration, assume a driver is looking to the left or rightwhile driving a vehicle. In some embodiments, a seam may be placedvertically to avoid placing a seam in a region of a surround viewvisualization corresponding to the left or right of the vehicle, as inFIG. 4B, which shows an example surround view visualization 410 with aseam 415 extending vertically along the left surface of a vehicle. Nowassume the driver is looking forward while driving the vehicle. In someembodiments, a seam may be placed horizontally to avoid placing a seamin a region corresponding to the front of the vehicle, as in FIG. 4C,which shows an example surround view visualization 420 with a seam 425extending horizontally from the side surfaces of a vehicle to avoidcreating artifacts in front of the vehicle.

In one or more embodiments, a saliency mask may be generated,determined, or obtained using a combination of object detectioninformation, surrounding object distance information, ego-vehicledriving direction information, ego-vehicle speed information, currentviewport information (e.g., direction of an active viewport), and/orother factors. For example, drivers typically pay most attention to thefront when driving forward in an open space, so a default viewportscenario may place a viewport facing forward. As a result, a saliencymask may allocate higher weights/weighting in the center of the forwardfacing view based on the viewport facing forward, or the ego-vehicledriving forward, and a seam may be placed horizontally to present abetter (less disrupted) forward facing view to the driver. However, whendriving at low speeds and/or whenever an object (e.g., another vehicle)passes by closely (e.g., within a threshold distance), the driver mayneed to pay closer attention to the object passing by. As a result, inthis case, a saliency mask may allocate higher weights to a region orpixels corresponding to the passing object, and a seam may be placed orrelocated to avoid cutting through (intersecting or bisecting) thepassing object. In another example, when navigating backward, the driveror operator may need to pay more attention to the rear of theego-vehicle. As such, in this case, a seam may be placed (e.g.,horizontally) to avoid cutting through objects behind the ego-vehicle.FIG. 4D is an example surround view visualization 430 updated with abackwards facing viewport 440 based on a vehicle traveling in reverse orthe viewport 440 facing backwards.

In one or more embodiments, a saliency mask may be generated,determined, or obtained by training a machine learning model to predicta saliency mask under various viewports. In one or more embodiments, adynamic seam location may be selected based on viewport information,with the location(s) of a seam(s) being dynamically determined oradjusted to avoid or minimize intersecting or bisecting salient objects.In embodiments, a salient object may be identified by any or acombination of: an object's distance to the ego-vehicle, a direction ofmovement of the ego-vehicle, or a driver's gaze direction (e.g., whenthe ego-vehicle drives forward, salient objects are determined to thefront of the ego-vehicle, while when reversing, salient objects aredetermined to the back). As such, dynamically placing seams to avoidbeing placed in a saliency region avoids obvious visual artifactsappearing in a driver or operator's likely region or viewport ofinterest.

In one or more embodiments, a dynamic seam state machine dynamicallychanges under different viewports to correspond with differentviewports. In one or more embodiments, a viewport can be dynamicallychanged based on driver's gaze information and/or direction ofego-motion. For example, when a driver drives along an alley, or if thedriver continuously checks the surrounding cars to avoid collision whenparking into a parking spot, then the viewport will be automaticallyselected or changed towards the driver's detected gaze direction orbased on the ego-motion information to give the driver a bettervisualization. As such, in some embodiments, when driving or operatingthe ego-vehicle forward for a parking maneuver, the viewport may bechanged to a forward-looking viewport; whereas when reversing during aparking maneuver, the viewport may be changed or selected to abackward-looking viewport.

In one or more embodiments, a dynamic seam location may be optimized toachieve high visual quality, e.g., to place seam on lower weightsaliency region, and salient regions may be determined and/or weightedbased on a designated or active viewport (e.g., which may be selectedbased on driving direction or scenario, driver gaze, etc.). FIG. 4Eillustrates example surround view visualizations 450 a-f with seamsplaced to avoid salient regions in various viewports. In FIG. 4E,surround view visualizations 450 a-d represent side viewports andsurround view visualizations 450 e-f represent top-down viewports. Insome embodiments, a designated (e.g., active) viewport may be assigned aviewport cost map, such as the one illustrated in FIG. 4F, which assignsa measure of saliency (e.g., higher values) to pixels in the center ofthe viewport and assigns a measure of non-saliency (e.g., lower values)to pixels toward the edge or boundary of the viewport. In someembodiments, the viewport cost map (or corresponding projected pixels)may be updated to represent a different region of saliency, for example,based on a determination that the driver's gaze is directed toward adifferent portion of the viewport. Additionally or alternatively,salient regions represented by the viewport cost map may be combinedwith other salient regions (e.g., representing detected objects, regionsin the environment where the driver's gaze is directed) and the salientregions may be represented (e.g., by projecting corresponding pixelsinto) in a common saliency mask. Using and/or incorporating a viewportcost map may serve to encourage shorter seam lengths, since each seammay have endpoints that begin at a representation of the ego-vehicle inthe center of the viewport (e.g., a black box, a computer graphic 3Dmodel of the ego-vehicle, an under vehicle reconstruction) and terminateat the edge of the viewport. As a result, an optimal seam location maybe dynamically determined or adjusted to avoid or minimize intersectingor bisecting salient regions, which may effectively identify theshortest candidate seam that crosses the fewest (e.g., projected) pixelsof the viewport cost map and/or that encourages placing a seam towardsthe boundary of a viewport's field of view.

In in the surround view visualizations 450 a-f of FIG. 4E, thenon-optimized candidate seams 460 a-f represent candidate seams thatcross salient regions, and the optimized seams 470 a-f representoptimized seams that avoid or minimize crossing salient regions. Forexample, in the surround view visualizations 450 a-d and the upper halfof the surround view visualization 450 e, the non-optimized candidateseams 460 a-f cross salient objects, so the corresponding optimizedseams 470 a-e have been selected to avoid crossing those salientobjects. In the surround view visualization 450 f, the non-optimizedcandidate seams 460 f are longer than the optimized candidate seams 470f. As a result, using a corresponding measure of saliency in a costfunction serves to assign the non-optimized candidate seams 460 f ahigher cost than the optimized candidate seams 470 f because thenon-optimized candidate seams 460 f cross more salient pixels.

FIG. 5 is an example dynamic seam stitching module 500, in accordancewith some embodiments of the present disclosure. The dynamic seamstitching module 500 may correspond to the stitching module 120 of FIG.1 or the stitching module 304 of FIG. 3 . In this example, the dynamicseam stitching module 500 includes an alignment component 510, a seamplacement component 520, and a blending component 590. At a high level,the alignment component 510 may receive frames of image data (e.g.,images) representing overlapping views of an environment, and usecorresponding camera parameters to align the frames and create analigned composite image (e.g., a panorama, a 360° image) with regions ofoverlapping image data. The seam placement component 520 may determinewhere in each overlapping region to place a seam, and the blendingcomponent 590 may use any suitable blending technique to blend theoverlapping image data at each seam and create a stitched image (e.g., apanorama or 360° image).

In the embodiment illustrated in FIG. 5 , the seam placement component520 includes a dynamic seam toggling state machine 530 that determineswhether to use a default seam placement or a dynamic one (e.g., based onobject saliency). When using dynamic seam placement, an object detector540 may perform or access results of object detection and generate oneor more object and/or saliency masks for each frame. A projectioncomponent 550 may project each object and/or saliency mask onto thecomposite image to overlay each overlapping region of image data withtwo (or more) different object and/or saliency masks representingobjects or other salient regions detected by multiple sensors. As such,a seam steering component 560 may determine a seam placement for eachoverlapping region (e.g., using a seam update state machine 570) toavoid (or attempt to avoid) placing a seam on salient regionsrepresented in both overlaid masks. In some embodiments, a temporalfilter may be applied over a temporal window (e.g., 30 frames of databuffered first-in-first-out) to signals used to determine dynamic seamlocations (e.g., distances to detected objects) and/or determined seamlocations to stabilize the seam over time and reduce or minimize jumpsin seams from frame-to-frame.

Turning initially to the dynamic seam toggling state machine 530 of theseam placement component 520, the dynamic seam toggling state machine530 may determine whether to use a default seam placement or dynamicseam placement that avoids salient objects or regions. FIG. 6 is anexample dynamic seam placement state machine 600 that includes a dynamicseam toggling state machine 601, which may correspond to the dynamicseam toggling state machine 530 of FIG. 5 . The dynamic seam placementstate machine 600 of FIG. 6 implements an example decision tree thatdetermines whether to use a default seam placement or dynamic seamplacement that avoids salient objects or regions, and to enable anddisable dynamic seam placement based on speed of ego-motion of anego-vehicle, direction of ego-motion, proximity to salient objects,active viewport direction, driver gaze, and/or other factors. Forexample, for each time slice and a set of frames being stitched for thattime slice, the dynamic seam placement state machine 600 may determinewhere to place a seam in each region of overlapping image data in analigned composite image generated from the set of frames.

The dynamic seam placement state machine 600 of FIG. 6 includes adynamic seam toggling state machine 601 (e.g., corresponding to thedynamic seam toggling state machine 530 of FIG. 5 ) that toggles betweendynamic and default seam placements, and a seam update state machine 651(e.g., corresponding to the seam update state machine 570 of FIG. 5 )that determines a seam placement that avoids (or attempts to avoid)placing a seam on salient regions and/or gradually moves a seam overtime.

The dynamic seam toggling state machine 601 includes a forward viewportor forward driving scenario 605 that is activated based on adetermination that an ego-vehicle is traveling in a substantiallyforward direction (e.g., based on a signal representing the direction ofego-motion), or based on the viewport of a virtual camera facingforward. In this scenario, at block 610, ego-speed is determined (e.g.,from a speed sensor). If ego-speed is below a lower speed threshold(e.g., less than 5 km/hr), at block 615, the dynamic seam toggling statemachine determines the distance to the closest object (e.g., detected orsalient object), as explained in more detail below. If the distance tothe closest object is less than some threshold proximity (e.g., lessthan 3 m), the dynamic seam toggling state machine 601 enables dynamicseam placement at block 620. If the distance to the closest object ismore than the threshold proximity, the dynamic seam toggling statemachine 601 disables dynamic seam placement at block 625 and uses adefault seam placement instead (e.g., horizontal or other defaultvalues, such as one that minimizes seam length visible in a particularviewport). Returning to block 610, when ego-speed is within some mediumspeed range (e.g., from 5-16 km/hr), the dynamic seam toggling statemachine 601 disables dynamic seam placement in favor of a default seam(e.g., a horizontal seam).

Returning to block 615, distances to surrounding objects may be obtainedor determined in various ways. In some embodiments, 3D object detectionis performed (e.g., by processing sensor data) and/or a representationof detected 3D objects (e.g., 3D cuboids in rig coordinates) isaccessed. For example, distance to objects may be computed using depthor stereo camera arrays; alternatively, 2D bounding boxes and/or 3Dcuboids may be predicted from image data (e.g., fisheye images), LiDARor RADAR detections, and/or other sensor data using one or more machinelearning models. As such, distances between the vehicle (e.g., thevehicle center) and the detected objects may be computed from predictedand/or sensed locations. Additionally or alternatively, sensor data suchas a LiDAR or RADAR point cloud representing detected objects may beprojected into a 2D top-down occupancy grid, and distances to detectedobjects may be computed from the 2D occupancy grid. In some embodiments,some other ranging technique may be additionally such alternativelyapplied, for example, by feeding image data into a deep neural networktrained on LiDAR data to predict a depth map representing distances todetected objects. These are just a few examples, and other rangingtechniques may additionally or alternatively be applied.

The dynamic seamtoggling state machine 601 also includes a scenario 630for other viewports or driving directions besides forward facing orforward moving. In this example, scenario 630 may be activated based ona determination that an ego-vehicle is traveling in some other directionbesides substantially forward (e.g., based on a signal representing thedirection of ego-motion), or based on the viewport of a virtual camerafacing some other direction besides forward. In this scenario, at block635, ego-speed is determined (e.g., from a speed sensor). If ego-speedis below a lower speed threshold (e.g., less than 5 km/hr), the dynamicseam toggling state machine 601 enables dynamic seam placement at block640. If ego-speed is within some medium speed range (e.g., from 5-16km/hr), the dynamic seam toggling state machine 601 enables dynamic seamplacement at block 645.

The example illustrated in FIG. 6 shows an example scenario in whichblocks 620 and 640 trigger the seam update state machine 651 to activatea close object priority scenario 655, and block 645 triggers the seamupdate state machine 651 to activate a gradual move scenario 675, butthis is just an example. Generally, the dynamic seam toggling statemachine 601 may trigger any dynamic seam placement technique, includingbut not limited to the close object priority scenario 655, the gradualmove scenario 675, techniques implemented by the seam placementcomponent 520 of FIG. 5 , techniques involved in the seam steering 760of FIG. 7A, and/or others.

For example, returning briefly to FIG. 5 , in some embodiments, thedynamic seam toggling state machine 530 of the seam placement component520 triggers dynamic seam placement using the object detector 540, theprojection component 550, and the seam steering component 560. FIG. 7Ais a diagram illustrating an example dynamic seam placement technique700 using corresponding steps for object detection 740, projection 750,and seem steering 760.

The example dynamic seam placement technique 700 operates on inputimages 705 a-d, which in this example, are fisheye images representingfront, right, rear, and left fields of view with respect to anego-vehicle (e.g., a car), respectively.

In some embodiments, the input images 705 a-d are processed (e.g., bythe object detector 540 of FIG. 5 ) using object detection 740 togenerate corresponding masks 745 a-d (e.g., object or saliency masks).Generally, object detection 740 may be performed using 2D objectdetection (e.g., applied on the input images 705 a-d) or 3D objectdetection (e.g., from the images 705 a-d, a 3D point cloud ofcorresponding LiDAR or RADAR detections). In some embodiments, salientobjects may be identified from detected objects, for example, based on adetected object class (e.g., emergency vehicles, road users, movingobjects), driving scenario (e.g., surrounding vehicles when parkingbetween the vehicles, any close object when parallel parking), distanceto the vehicle, and/or other factors. In some embodiments, detectedobjects beyond some threshold distance to the ego-vehicle (e.g., 3 m, 7m, 15 m, etc.) may be filtered out or ignored to free up computationalresources. Depending on the implementation, objects deemed to benon-salient may be filtered out or ignored, and/or salient objects maybe assigned a higher priority (or weight) than non-salient objects inthe masks 745 a-d. In some embodiments, object detection 740 maygenerate a binary object mask representing whether or not each point(e.g., each pixel in a 2D mask corresponding to each image) correspondsto a depicted part of a detected (salient) object represented by eachinput image 705 a-d, and the binary object masks for the input image 705a-d may be used as the masks 745 a-d. In some embodiments, a binaryobject mask may be used to derive a weighted saliency mask representinga measure of saliency of each point (e.g., by weighting the binaryobject mask based on proximity of a corresponding detected object,prioritizing certain classifications of detected object, using logicthat depends on driving scenario), and the weighted saliency masks forthe input image 705 a-d may be used as the masks 745 a-d. In someembodiments, a machine-learning model may be trained to predict aweighted saliency mask using training data labeled with a measure ofground truth saliency (e.g., objects or regions deemed to be importantby human labelers). In some cases, driver gaze may be monitored, theregion of the environment where the driver is looking may be projectedinto a corresponding region of the masks 745 a-d, and the correspondingregion (e.g., corresponding pixels) may be updated in the masks 745 a-dto represent a measure of saliency. These are just a few examples, andother techniques that quantify saliency are contemplated within thescope of the present disclosure.

Projection 750 may be performed (e.g., by projection component 550 ofFIG. 5 ) to project the masks 745 a-d onto a 2D or 3D representation ofthe environment surrounding the ego-machine (e.g., a top-downrepresentation such as a representation of the ground) using cameracalibration parameters representing each corresponding camera's positionand orientation in the environment to generate corresponding maskprojections 755 a-d that are aligned with the environment. For example,the projection component 550 may perform a top-down projection toproject each of the masks 745 a-d onto a ground plane of the environmentto generate corresponding mask projections 755 a-d. In this example,since the input images 705 a-d had overlapping fields of view (e.g., thesame nearby vehicle is visible in input images 705 a and 705 d), thecorresponding masks 745 a-d have overlapping fields of view (e.g., masks745 a and 745 d have corresponding pixels that indicate the presence ofthe nearby vehicle), and the corresponding mask projections 755 a-d haveoverlapping fields of view (e.g., mask projections 755 a and 755 d havecorresponding pixels that indicate the presence of the nearby vehicle).

Generally, since the input images 705 a-d represent differentperspectives, the same object may be represented in different imageswith different shapes and sizes. As a result, an overlapping regionbetween two of the mask projections 755 a-d may have differentrepresentations of the same object that do not align perfectly.Accordingly, some embodiment may consider data from multiple overlappingmask projections when determining where to place a seam for a stitchedimage 770.

More specifically, overlapping regions between the mask projections 755a-d may be determined. For example, projection 750 may includeidentifying the portion of two or more of the mask projections 755 a-dthat lands in an overlapping region between the projections 755 a-d. Themask projections 755 a-d may be represented as corresponding 2D (e.g.,top-down) or 3D representations of the environment that at leastpartially overlap with one another, forming a 2D or 3D overlappingregion in which object or saliency data from the masks 745 a-d projectsonto the same point (e.g., pixel, voxel). Taking an overlapping regionin a 2D top-down representation of the surrounding environment as anexample, any given pixel in the overlapping region may representdifferent values for different mask projections 755 a-d. In the exampleillustrated in FIG. 7A which involves four input images 705 a-d,projection 750 may serve to identify four overlapping regionscorresponding to the front-left, front-right, rear-right, and rear-leftof the ego-vehicle, with each overlapping region including twooverlapping mask projections. FIG. 7B is an enlarged view of theoverlapping mask projections 762 a and 762 b, 764 a and 764 b, 766 a and766 b, 768 a and 768 b, respectively.

Each set of overlapping mask projections in an overlapping regionrepresents an overlapping field of view represented by correspondinginput images 705 a-d. In order to stitch the corresponding input images705 a-d, seam steering 760 may be performed (e.g., by seam steeringcomponent 560 of FIG. 5 ) to determine a seam placement for eachoverlapping region (e g., seams 782 a and 782 b, 784 a and 784 b, 786 aand 786 b, 788 a and 788 b of FIG. 7B). Based on correspondences betweeninput images 705 a-d, an aligned representation thereof (e.g., acomposite image such as a panorama or 360° image, a 2D or 3D top-downprojection), and the overlapping mask projections 762 a-768 b, theoverlapping mask projections 762 a-768 b may be thought of as beingoverlaid on or projected onto the composite image or surface. As such,the overlapping mask projections 762 a-768 b may share a coordinatesystem with the composite image or surface, and as a result, a seamplacement determined with respect to the overlapping mask projections762 a-768 b may implicitly determine a seam placement in the samecoordinate system as the image data to be stitched. Generally, seamplacement may be determined at any stage and any coordinate system andmay be translated, (un)projected, or otherwise transferred to acorresponding stage of an image stitching pipeline. For the purposes ofillustration, the example represented by FIGS. 7A and 7B involvesprojecting the input images 705 a-d and the masks 745 a-d intocorresponding 2D top-down representations with 2D overlapping regions,and a seam is placed in (or on a boundary of) each 2D overlappingregion.

Seam steering 760 may place (or attempt to place) a seam in eachoverlapping region that avoids (or minimizes) crossing (salient) objectsor regions represented in each of the overlapping mask projections 762a-768 b. In some cases, a seam may be steered through each overlappingregion to avoid or minimize cutting through or intersecting object orsalient pixels represented in the overlapping mask projections 762 a-768b. Seams of any shape may be used, whether straight, curved, segmented,or otherwise. Any suitable technique may be used to iterate or scanthrough possible seam placements, and each candidate seam placement maybe evaluated to determine whether and/or to what extent the candidateseam intersects object or salient pixels (e.g., a binary indication,count of intersected pixels, a sum of intersected pixels weighted bysaliency).

FIG. 7B illustrates an example involving a straight seam and asubstantially rectangular overlapping region between the overlappingmask projections 762 a and 762 b. One end of a seam may be placed at oneend point in the region, such as a corner representing the closestportion of the environment to the ego-vehicle (e.g., corners 770 a, 770b), and a straight line may be drawn from that end point to another endpoint at an opposite portion of the region. For example, starting at thecorners 770 a, 770 b, a ray may be projected out at an angle 775, a seamspace may be scanned over different values of the angle 775 to cover theoverlapping mask projections 762 a and 762 b (e.g., from 0-90°), andeach corresponding candidate seam (e.g., seam 782) may be traversedthrough the pixels of the region and evaluated. In the exampleillustrated in FIGS. 7A and 7B, seam steering 760 identified seams 782 aand 782 b, 784 a and 784 b, 786 a and 786 b, and 788 a and 788 b thatavoid or minimize intersecting object or salient pixels.

In some situations, scanning a seam space and evaluating candidate seamsserves to identify multiple candidate seams (e.g., whether straight orsome other shape) that avoid object or salient pixels. If there aremultiple candidate seams, seam steering 760 may select one thatminimizes the distance to the seam placement from the previous timeslice, maximizes separation from salient or object pixels, or otherwise.

In some situations, there may be no candidate seam that avoids object orsalient pixels. In some such situations, seam steering 760 may select adefault seam placement (e.g., vertical, horizontal, diagonal) or a seamplacement from the previous time slice. Additionally or alternatively,seam steering 760 may include evaluating candidate seams to identify aseam that minimizes the amount of intersected object or salient pixels,that minimizes an energy function (e.g., having a cost that penalizescandidate seams that intersect or cover object or salient pixels, isweighted by saliency or proximity, is based on multiple images fromdifferent time slices to promote temporal stability), and or otherwise.If there are multiple candidate seams that would satisfy the designatedcriteria (e.g., that have the same cost or number of intersected objector salient pixels), seam steering 760 may include selecting a seamlocation that minimizes the distance to the seam placement from theprevious time slice. In some embodiments, costs for different seamshapes may be evaluated using a connected components analysis and/orusing dynamic programming (e.g., based on designated permissibledirections in which a seam may traverse to adjacent diagonal or edgepixels). In some embodiments, edge detection is performed (e.g., onimage data in the overlapping region), and graph cuts and/or seamcarving may be applied to carve out high frequency content (e.g.,detected edges) of the overlapping region from potential solutions, suchthat candidate seams cannot travel through the graph cut. As a result,high frequency content like the edges of vehicles may be effectivelydesignated as salient to discourage placing seams over detected edges,which may represent visually important information.

By way of non-limiting example, and returning to FIG. 6 , in some cases,seam steering 760 may implement the close object priority scenario 655of the seam update state machine 651 of FIG. 6 . In this example, atblock 660, the closest detected object to the ego-vehicle is identified(e.g., using any of the ranging techniques described herein), and adetermination is made whether it is possible to avoid placing a seamthat avoids the closest detected object (e.g., using any of thetechniques described herein). In some embodiments, the identification ofthe closest detected object is represented in a corresponding object orsaliency mask, and the determination of whether there a seam that avoidsthe closest detected object uses the seam scanning techniques describedherein. If it is possible to avoid the closest detected object, at block665, a seam placement that avoids the closest object is selected (e.g.,using any of the techniques described herein). If it is not possible, atblock 670, a default seam placement is used (e.g., horizontal, vertical,diagonal, the default seam placement for the applicable scenario 605 and630).

In another non-limiting example, in some cases, seam steering 760 mayimplement the gradual move scenario 675 of the seam update state machine651 of FIG. 6 . In this example, at block 680, the previous frame's seamis retrieved, and, at block 685, is used as a baseline to determinewhether the previous frame's seam crosses an object in the current frame(e.g., using the seam scanning techniques described herein). If theprevious frame's seam does not cross an object in the current frame, atblock 690, the previous frame's seam is selected for the current frame.If the previous frame's seam does cross an object in the current frame,at block 695, distant objects may be removed from the mask projectionbeing evaluated (e.g., detected objects that are farther than somedistance such as 15 meters, detected objects that are less than somelength in pixels such as 500 pixels), a vertical or horizontal seam isselected when the entire mask projection being evaluated (theoverlapping region) is occupied by a detected object (to avoid theobject), and/or a seam is selected that crosses the least number of(remaining) object pixels (e.g., using proximity to the previous frame'sseam as a tie-breaker).

As such, and returning to the example illustrated in FIGS. 7A and 7B,seam 782 a and 782 b may be selected for overlapping mask projections762 a and 762 b, seam 784 a and 784 b may be selected for overlappingmask projections 764 a and 764 b, seam 786 a and 786 b may be selectedfor overlapping mask projections 766 a and 766 b, and seam 788 a and 788b may be selected for overlapping mask projections 768 a and 768 b.Accordingly, the input images 705 a-d may be aligned, stitched at theseams, and projected (or aligned, projected, and stitched at the seams)to create the stitched image 770 (in this example, a stitched top-downprojection image).

FIGS. 8A and 8B are example images and surround view visualizations withand without dynamic seam placement, in accordance with some embodimentsof the present disclosure. More specifically, in FIG. 8A, front image801 and right image 802 show fields of view in front and to the right ofan ego-vehicle, respectively. Circles 805 and 806 identify where thesame object (a nearby vehicle) appears in the front image 801 and theright image 802, respectively. In FIG. 8B, the left column shows twostitched images 810, 830 in which the front image 801 and the rightimage 802 (as well as rear and left images, not depicted) have beenstitched together, and the right column shows correspondingvisualizations 850, 860 of the seam locations in the stitched images810, 830. The top row illustrates a scenario in which a fixed diagonalseam (e.g., seam 855) is used, and the bottom row illustrates a scenarioin which dynamic seam placement is used to steer a seam to avoiddetected or salient objects. The vehicle shown in circles 805 and 806 ofFIG. 8A is shown in circle 820 in the enlarged view 815 of the top rightportion of the stitched image 810 and in circle 840 in the enlarged view835 of the top right portion of the stitched image 830. In the fixeddiagonal seam scenario, since the seam 855 cuts through the vehiclerepresented in circle 820, ghosting has occurred. By contrast, in thedynamic seam placement scenario, seam 865 was selected to avoid crossingthe vehicle, and as a result, there is no ghosting in the circle 840.

FIG. 9 shows example surround view visualizations with and withoutdynamic seam placement, in accordance with some embodiments of thepresent disclosure. In FIG. 9 , the top row shows surround viewvisualizations 910, 920 that use fixed seam placement, and the bottomrow shows corresponding surround view visualizations 930, 940 that usedynamic seam placement. In surround view visualizations 910, a fixedseam placement results in ghosting in circle 915, whereas in surroundview visualizations 930, dynamic seam placement does not result inghosting in circle 935. Similarly, in surround view visualizations 920,a fixed seam placement causes ghosting in circle 925, whereas insurround view visualizations 940, dynamic seam placement does not resultin ghosting in circle 945.

Now referring to FIGS. 10, 11, and 12 , each block of methods 1000,1100, and 1200, described herein, comprises a computing process that maybe performed using any combination of hardware, firmware, and/orsoftware. For instance, various functions may be carried out by aprocessor executing instructions stored in memory. The methods may alsobe embodied as computer-usable instructions stored on computer storagemedia. The methods may be provided by a standalone application, aservice or hosted service (standalone or in combination with anotherhosted service), or a plug-in to another product, to name a few. Inaddition, methods 1000, 1100, and 1200 may be understood, by way ofexample, with respect to the stitching module 120 of FIG. 1 , thestitching module 304 of FIG. 3 , or the dynamic seam stitching module500 of FIG. 5 . However, these methods may additionally or alternativelybe executed by any one system, or any combination of systems, including,but not limited to, those described herein.

FIG. 10 is a flow diagram showing a method 1000 for determining aposition for a seam using projected masks, in accordance with someembodiments of the present disclosure. The method 1000, at block B1002includes generating two or more aligned image frames captured during afirst time slice and representative of two or more overlappingviewpoints around an ego-object in an environment. For example, withrespect to FIG. 5 , the alignment component 510 may access frames ofimage data (e.g., images) representing overlapping views of anenvironment, and use corresponding camera parameters to align the framesand create an aligned composite image (e.g., a panorama, a 360° image, atop-down projection image) with regions of overlapping image data.

The method 1000, at block B1004 includes generating two or moreprojected masks comprising overlapping representations of detectedobjects depicted in the two or more aligned image frames. For example,with respect to FIG. 5 , the projection component 550 may project two ormore object and/or saliency masks onto an aligned composite image tooverlay each overlapping region of image data with two (or more)different object and/or saliency masks representing objects or othersalient regions detected by multiple sensors.

The method 1000, at block B1006 includes determining a candidateposition for a seam in an overlapping region of the two or more alignedimage frame. For example, with respect to FIG. 5 , the seam steeringcomponent 560 may determine a seam placement for each overlapping regionto avoid (or attempt to avoid) placing a seam on salient regionsrepresented in two or more overlaid (projected) masks. As such and withrespect to FIG. 7B, seam steering 760 may be performed (e.g., by seamsteering component 560 of FIG. 5 ) to determine a seam placement foreach overlapping region (e g., seams 782 a and 782 b, 784 a and 784 b,786 a and 786 b, 788 a and 788 b of FIG. 7B). In an example with astraight seam, starting at the corners 770 a, 770 b, a ray may beprojected out at an angle 775, a seam space may be scanned overdifferent values of the angle 775 to cover the overlapping maskprojections 762 a and 762 b (e.g., from 0-90°), and one or morecorresponding candidate seams (e.g., seam 782) may be traversed throughthe pixels of the region and evaluated. In some embodiments, the seamupdate state machine 651 of FIG. 6 may retrieve a previous frame's seamand use the previous frame's seam as a candidate seam for the currentframe. Generally, seams of any shape may be used, whether straight,curved, segmented, or otherwise.

The method 1000, at block B1008 includes updating the candidate positionfor the seam to an updated position based at least on an intersection ofthe seam at the candidate position with pixels of the detected objectsin one or more of the two or more projected masks. For example, withrespect to FIG. 7B, seam steering 760 may be performed (e.g., by seamsteering component 560 of FIG. 5 ) by evaluating each of a plurality ofcandidate seams to determine whether and/or to what extent eachcandidate seam intersects object or salient pixels (e.g., a binaryindication, count of intersected pixels, a sum of intersected pixelsweighted by saliency), and an updated seam position that avoids (orminimizes) crossing (salient) objects or regions represented in each ofthe overlapping mask projections 762 a-768 b may be determined.

In some embodiments, the seam update state machine 651 of FIG. 6 maydetermine whether a previous frame's seam crosses an object in thecurrent frame (e.g., using the seam scanning techniques describedherein). If the previous frame's seam does cross an object in thecurrent frame, at block 695, distant objects may be removed from themask projection being evaluated (e.g., detected objects that are fartherthan some distance such as 15 meters, detected objects that are lessthan some length in pixels such as 500 pixels), a vertical or horizontalseam is selected when the entire mask projection being evaluated (theoverlapping region) is occupied by a detected object (to avoid theobject), and/or a seam is selected that crosses the least number of(remaining) object pixels (e.g., using proximity to the previous frame'sseam as a tie-breaker).

The method 1000, at block B1010 includes generating a visualizationcomposite image frame based at least on stitching the two or morealigned image frames using the updated position of the seam. Forexample, with respect to FIG. 5 , the blending component 590 may use anysuitable blending technique to blend overlapping image data at each seamand create a stitched image (e.g., a panorama, 360° image, projectionimage).

FIG. 11 is a flow diagram showing a method 1100 for determining aposition for a seam based at least on projected overlappingrepresentations of one or more salient regions, in accordance with someembodiments of the present disclosure. The method 1100, at block B1102includes generating two or more saliency masks representing one or moresalient regions depicted in corresponding image frames captured usingtwo or more cameras of an ego-object in an environment during a firsttime slice. For example, the object detector 540 may perform or accessresults of object detection and generate one or more object and/orsaliency masks for each image frame. In some embodiments, and withrespect to FIG. 7A, object detection 740 may generate a binary objectmask representing whether or not each point (e.g., each pixel in a 2Dmask corresponding to each image) corresponds to a depicted part of adetected (salient) object represented by each input image 705 a-d, andthe binary object masks for the input image 705 a-d may be used as themasks 745 a-d. In some embodiments, a binary object mask may be used toderive a weighted saliency mask representing a measure of saliency ofeach point (e.g., by weighting the binary object mask based on proximityof a corresponding detected object, prioritizing certain classificationsof detected object, using logic that depends on driving scenario).

The method 1100, at block B1104 includes generating an aligned compositerepresentation that aligns the corresponding image frames and includes aregion of overlapping image data. For example, with respect to FIG. 5 ,the alignment component 510 may access frames of image data (e.g.,images) representing overlapping views of an environment, and usecorresponding camera parameters to align the frames and create analigned composite image (e.g., a panorama, a 360° image, a top-downprojection image) with regions of overlapping image data.

The method 1100, at block B1106 includes generating projectedoverlapping representations of the one or more salient regions based atleast on projecting the two or more saliency masks onto a portion of thealigned composite representation corresponding to the region ofoverlapping image data. For example, with respect to FIG. 5 , theprojection component 550 may project two or more object and/or saliencymasks onto an aligned composite image to overlay each overlapping regionof image data with two (or more) different object and/or saliency masksrepresenting objects or other salient regions detected by multiplesensors.

The method 1100, at block B1108 includes determining a seam location inthe region of overlapping image data based at least on the projectedoverlapping representations of the one or more salient regions. Forexample, with respect to FIG. 5 , the seam steering component 560 maydetermine a seam placement for each overlapping region to avoid (orattempt to avoid) placing a seam on salient regions represented in twoor more overlaid (projected) masks.

The method 1100, at block B1110 includes generating a stitchedrepresentation of the environment based at least on stitching the imagedata in the overlapping region at the seam location. For example, withrespect to FIG. 5 , the blending component 590 may use any suitableblending technique to blend overlapping image data at each seam andcreate a stitched image (e.g., a panorama, 360° image, projectionimage).

FIG. 12 is a flow diagram showing a method 1200 for determining whetherto use a default placement or a dynamic placement for a seam, inaccordance with some embodiments of the present disclosure. The method1200, at block B1202 includes generating two or more aligned imageframes captured during a first time slice and representative of two ormore overlapping viewpoints around an ego-object in an environment. Forexample, with respect to FIG. 5 , the alignment component 510 may accessframes of image data (e.g., images) representing overlapping views of anenvironment, and use corresponding camera parameters to align the framesand create an aligned composite image (e.g., a panorama, a 360° image, atop-down projection image) with regions of overlapping image data.

The method 1200, at block B1204 includes determining, based at least onan active viewport or a state of the ego-object, whether to use adefault placement for a seam, or a dynamic placement that attempts toavoid placing the seam in a salient region. For example, with respect toFIG. 6 , the dynamic seam toggling state machine 601 may determinewhether to use a default seam placement or a dynamic one, and enable anddisable dynamic seam placement based on speed of ego-motion of anego-vehicle, direction of ego-motion, proximity to salient objects,active viewport direction, driver gaze, and/or other factors. Forexample, for each time slice and a set of frames being stitched for thattime slice, the dynamic seam placement state machine 600 may determinewhere to place a seam in each region of overlapping image data in analigned composite image generated from the set of frames.

The method 1200, at block B1206 includes generating a visualizationcomposite image frame based at least on stitching image data of the twoor more aligned image frames using the default placement or the dynamicseam placement of the seam. For example, with respect to FIG. 5 , theblending component 590 may use any suitable blending technique to blendoverlapping image data at each seam and create a stitched image (e.g., apanorama, 360° image, projection image).

Adaptive 3D Bowl Model of the Surrounding Environment

One or more embodiments are directed to techniques for visualizing anenvironment surrounding an ego-object (e.g., a vehicle, such as theexample autonomous vehicle 3500 of FIGS. 35A-35D) using an adaptive 3Dbowl that models the surrounding environment and has a 3D shape thatchanges based on the distance to nearby detected objects. As such, imagedata representing the surrounding environment may be projected onto anadaptive 3D bowl to generate a textured 3D bowl that reduces visualdistortion over prior techniques, and a view of the textured 3D bowl maybe rendered from the perspective of a virtual camera.

By way of background, FIG. 13 illustrates an example data flow throughan example surround view visualization system 1300 that uses a 3D bowl1320, in accordance with some embodiments of the present disclosure. InFIG. 13 , the input images 1310 are four fisheye images captured byfisheye cameras located at the front, left, rear and right sides of avehicle body. In an example embodiment, the stitching module 1325 andthe projection module 1335 process the input images 1310 to generate atop-down view 1330 of the input images. For example, the input images1310 may be aligned by the stitching module 1325, stitched by thestitching module 1325, and projected by the projection module 1335 intothe top-down view 1330 (or aligned, projected, and stitched). In someembodiments, to perform a top-down projection, the projection module1335 back-projects 2D points in pixel coordinates to 3D points usingcorresponding depth values obtained from a 3D model of the surroundingenvironment (e.g., the 3D bowl 1320), from a depth map or range image(e.g., captured using LiDAR or RADAR sensor(s); generated by a deepneural network from image data, LiDAR or RADAR detections, and/or othersensor data), and/or otherwise. With the image data assigned tocorresponding 3D points, the projection module 1335 may orthographicallyproject the 3D points, for example, into the top-down view 1330. In someembodiments, a UV mapping module 1345 maps the top-down view 1330 ontothe 3D bowl 1320, texturizing the 3D bowl 1320 with image data, togenerate a textured 3D bowl 1340 (e.g., a UV mapped mesh, using forwardor reverse mapping), and a viewport renderer 1355 may position andorient a virtual camera in a 3D scene with the textured 3D bowl 1340 anduse the virtual camera to render a view (e.g., image 1350) of thetextured 3D bowl 1340 for a corresponding viewport.

Note that the surround view visualization system 1300 is meant simply asan example, and other techniques may be used to render a surround viewvisualization in a desired viewport. For example, given a desiredviewport (e.g., with designated position, orientation, size, shape), aview of a 3D representation of the surrounding environment through theviewport may be rendered directly from captured image data using areverse projection to determine which pixel in the captured image dataeach pixel of the viewport rendering maps to (e.g., without generatingthe top-down view 1330 or the textured 3D bowl 1340).

In embodiments that orthographically project image data into a 2Dprojection image (e.g., a top-down view) using depth values retrievedfrom a 3D bowl (e.g., the 3D bowl 1320), and/or embodiments that projectimage data onto a 3D bowl (e.g., using UV mapping), the shape of the 3Dbowl may impact the appearance of the projected image data. FIG. 14shows example surround view visualizations of textured circular andelliptical 3D bowls, in accordance with some embodiments of the presentdisclosure. In FIG. 14 , surround view visualization 1410 is a renderingof a circular 3D bowl textured with image data from the environmentsurrounding the black car, and surround view visualization 1440 is arendering of an elliptical 3D bowl textured with image data from theenvironment surrounding the black car. The circular 3D bowl used torender the surround view visualization 1410 is represented to the leftby circle 1420, which corresponds to a top-down view of the inner bowlof the circular 3D bowl. The circle 1420 illustrates an example positionand dimensions of the inner bowl with respect to an example vehiclerepresented by top-down vehicle outline 1430. The elliptical 3D bowlused to render the surround view visualization 1440 is represented byellipse 1450, which corresponds to a top-down view of the inner bowl ofthe elliptical 3D bowl. The ellipse 1450 illustrates an example positionand dimensions of the inner bowl with respect to an example vehiclerepresented by top-down vehicle outline 1460. Notice how objects infront of the vehicle are depicted in the surround view visualizations1410 and 1440 with different sizes, due to the underlying image databeing projected different distances in that direction between thevehicle and the sides of the circular and elliptical 3D bowls. Forexample, the bushes in front of the vehicle are closer to the vehiclethan the front side of the elliptical 3D bowl, so projecting an image ofthe buses onto the front side of the elliptical 3D bowl creates a scalemagnification.

More generally, FIG. 15 shows some example artifacts that can arise insurround view visualizations that use a 3D bowl. In a scalemagnification scenario, the front car 1510 has a rear-mounted camerawith a field of view 1515 that includes the rear car 1520. Using a 3Dbowl with a side 1525, the rear car 1520 is inside the side 1525 of the3D bowl. As a result, projecting an image of the rear car 1520 capturedby the rear-mounted camera onto the side 1525 of the 3D bowl will createa scale magnification since the assumed depth (distance to the side 1525of the 3D bowl) is greater than the actual depth (distance to the rearcar 1520).

In a ghosting scenario, multiple cameras mounted on the car 1530 pick upan object 1540 that is farther away than the side of the 3D bowl 1550(e.g., outside the 3D bowl 1550). In this case, projecting multipleimages of the object 1540 onto the side of the 3D bowl 1550 leads toghosting or duplication, as the projected object 1560 appears twice onthe side of the 3D bowl 1550.

In an object disappearance scenario, an object 1570 that is inside the3D bowl 1580 is partially or completely lost during the projectionprocess, leading to partial or complete object disappearance.

To reduce visual artifacts such as these, some embodiments that use a 3Dbowl that models the surrounding environment (e.g., a 3D bowl mesh)adapt the shape of the 3D bowl based on distance to detected objects.FIG. 16 illustrates an example adaptive 3D bowl generator 1600 thatadapts a 3D bowl based on distance to detected objects, in accordancewith some embodiments of the present disclosure. The adaptive 3D bowlgenerator 1600 may correspond to the adaptive 3D bowl generator 150 ofFIG. 1 (e.g., the 3D object detector 1620 and the radial distance mapper1630 may correspond to the depth estimator 155 of FIG. 1 , and the 3Dbowl parameter controller 1650 may correspond to the 3D bowl adapter 160of FIG. 1 ). At a high level, one or more sensors (e.g., cameras) of anego-object (e.g., a vehicle) may be used to capture input images 1610and/or other sensor data. A 3D object detector 1620 may perform 3Dobject detection on the sensor data (e.g., the input images 1610,corresponding RADAR or LiDAR data) representing the surroundingenvironment. In some embodiments, sensor data such as a LiDAR pointcloud may be projected onto a top-down 2D occupancy grid that representslocations of detected objects. A radial distance mapper 1630 may computedistance(s) to the closest detected object as a function of angle (e.g.,representing a rotation around an axis of the coordinate system of theego-object, such as yaw). A 3D bowl parameter controller 1650 may usethe distances and directions to adapt the shape of a 3D bowl (e.g., amesh or other geometric model), for example, by (re)sizing the groundplane of the 3D bowl to fit within the distance to the closest detectedobject, or modifying the shape, size (e.g., along one or more dimensionsor axes), or orientation (e.g., rotation relative to an ego-vehicle) ofthe 3D bowl to place the closest detected object(s) between the groundplane of the bowl and the outer edge or rim of the bowl. As such, theadaptive 3D bowl generator 1600 may generate a 3D bowl by identifyingrepresentative parameters for the 3D bowl that adapt based on distanceand/or direction to detected objects.

As such, the adapted 3D bowl may be used to render a surround viewvisualization of the input images 1610. In some embodiments, the inputimages 1610 are aligned, stitched, and orthographically projected (oraligned, orthographically projected, and stitched) using correspondingdepth values from the adapted 3D bowl to generate a projection image(e.g., corresponding to output image 1670). In some embodiments, theprojection image is mapped onto the adapted 3D bowl, or the input images1610 are otherwise aligned, stitched, and projected onto the adapted 3Dbowl (or aligned, projected, and stitched), to create a textured 3D bowlsurface. In some embodiments, a view of the textured 3D bowl surface isrendered, and the rendered view may be presented on a monitor or otherdisplay device visible to an occupant or operator (e.g., driver) of theego-object.

Generally, distances and directions to surrounding objects may beobtained or determined in various ways. In the example illustrated inFIG. 16 , the 3D object detector 1620 performs any known technique toidentify 3D points that belong to surrounding objects, and the radialdistance mapper 1630 calculates distances and directions to the 3Dpoints. The 3D object detector 1620 may include or trigger one or moremachine learning models (e.g., neural networks) to predict 3D cuboidsfrom the input images 1610, corresponding LiDAR or RADAR detections,and/or other sensor data, and distances and directions between theego-object (e.g., vehicle center) to detected objects (e.g., closestpoint, 3D cuboids corner(s), center) may be computed from the predicted3D cuboids. In some embodiments, the 3D object detector 1620 uses one ormore machine learning models (e.g., neural networks) to predict (e.g.,top down) 2D bounding boxes, for example, from an orthographicprojection of image data (e.g., fisheye images), LiDAR or RADARdetections, and/or other sensor data, and distances and directionsbetween the ego-object and detected objects (e.g., closest point,bounding box corner(s), center) may be computed from the predicted 2Dbounding boxes. FIG. 16 illustrates example detected objects 1625surrounding an ego-object 1627, where the detected objects 1625 maycorrespond to predicted 2D top-down bounding boxes, and/or top-downprojections of predicted 3D bounding boxes or cuboids.

In another example technique for determining distances to surroundingobjects, sensor data such as a LiDAR or RADAR point cloud representingdetected objects may be projected into a 2D top-down occupancy grid, anddistances to the closest LiDAR or RADAR point in the 2D top-downoccupancy grid may be computed (e.g., by the depth estimator 155 of FIG.1 ). FIG. 17A shows an example 2D top-down occupancy grid 1710 where theego-object is represented in the middle of the 2D top-down occupancygrid 1710, occupied cells in the grid are illustrated as white, and thedistances and directions to surrounding objects are represented withgrey lines extending radially outward from the ego-object. In someembodiments, some other ranging technique may be additionally oralternatively applied, for example, using depth or stereo camera arrays,sensor data, and/or a machine learning model (e.g., a neural network) togenerate or predict depth values, a depth map, range ranges, or a rangeimage (e.g., a LiDAR range image); and distances to surrounding objectsmay be determined from corresponding depth or range values.

Regardless of how the surrounding detected objects are represented(e.g., detected 3D objects, occupied cells in a 2D occupancy grid), insome embodiments, detected objects beyond some threshold distance to theego-object (e.g., 3 m, 7 m, 15 m, etc.) may be filtered out or ignoredto free up computational resources and/or reduce distortion resultingfrom a wider range of detected objects. In some embodiments, the rangethreshold used during range filtering may be variable, for example,based on the distance to the closest detected objects. An exampleimplementation might start with an initial range threshold (e.g., 5 m),and expand to a farther range threshold (e.g., 7 m) based on no detectedobjects being within the initial range threshold, or reduce to a closerrange threshold (e.g., 3 m) based on detected objects being within theinitial range. As such, some embodiments may serve to identify one ormore closest detected objects within a variable range threshold. In somecases, detected objects outside the applicable range threshold may befiltered out or otherwise ignored from consideration.

As such, the radial distance mapper 1630 may compute distances anddirections to the (remaining) object detections 1625 using any of thetechniques described herein, and a representation of the distances anddirections may be generated. In some embodiments, the distances anddirections (illustrated by visualization 1635) are represented in aradial distance map, list of entries, or some other structure thatrepresents distance to the closest detected object and direction (e.g.,distance a function of an angle, such as yaw). For example, therepresentation of the distances and directions may take the form of a 2Darray with a radial component (e.g., identifying an angular increment,such as 1 degree) and one or more distance components (representing thedistance to the closest detected object in the direction represented bya corresponding angular increment, representing the distance to aparticular corner or center of the closest detected object in thatdirection). In FIG. 16 , the visualization 1635 illustrates one or morevalues that may be stored in an example radial distance map. Thevisualization 1635 depicts an ego-object 1640 (corresponding to theego-object 1627), bounding boxes of detected objects 1645 (correspondingto a set of the detected objects 1625 within a threshold range of theego-object 1627), and distances between the ego-object 1640 and thecorners of the bounding boxes of the detected objects 1645.

Based on the distances and directions between the ego-object 1640 andthe detected objects 1645, the 3D bowl parameter controller 1650 mayadapt the shape of a 3D bowl modeling the surrounding environment basedon the distances and directions to the detected objects 1645. In termsof bowl shape, the 3D bowl (e.g., a mesh) when viewed from top-down maybe circular, elliptical, or some other shape (whether regular orirregular), and the shape may be symmetrical or asymmetrical. Dependingon the embodiment, the distances and directions to the detected objects1645 may be used to fit a (e.g., symmetrical) bowl shape, synthesize ashape using local deformations to accommodate different distances todifferent objects, or otherwise. For example, the 3D bowl parametercontroller 1650 may use the distances and directions to adapt the shapeof a 3D bowl, for example, by (re)sizing the ground plane of the 3D bowlto fit within the distance to the closest detected object, or modifyingthe shape, size (e.g., along one or more dimensions or axes), and/ororientation (e.g., rotation relative to an ego-vehicle) of the 3D bowlto place (one more points that belong to) the closest detected object(s)between the ground plane of the bowl and the outer edge or rim of thebowl.

The implementation illustrated in FIG. 16 involves an elliptical 3D bowlwith an inner bowl 1655 and an outer bowl 1660 separated by somedistance, whether constant (e.g., 3 m) or variable, and an axis of theellipse (e.g., the short or long axis) may be aligned with an axis ofthe vehicle coordinate system (e.g., pointing to the front of thevehicle). In some embodiments, the 3D bowl parameter controller 1650 maysize the inner bowl 1655 to fit within the distance to the closestdetected object of the detected objects 1645. For example, the closestpoint to a detected object may be assumed to be at some point along thecircumference of an ellipse that is aligned with and centered on theego-object 1652 (e.g., which corresponds to the ego-object 1640 and theego-object 1627), and the ellipse may be parameterized using thestandard equation for an ellipse to calculate representative parametersfor the inner bowl (e.g., values of the short and long axes, foci,eccentricity, circumference, etc.). In some embodiments, the 3D bowlparameter controller 1650 may filter and/or cluster points that belongto multiple detected objects (e.g., the closest point for each detectedobject, a center point for each detected object), and use the filteredand/or clustered points to fit an ellipse (or multiple ellipses), orsome other shape, optionally rotated (e.g., to align a long or shortaxis with a particular detected object). For example, the 3D bowlparameter controller 1650 may identify a number of detected objects(e.g., with center points that fall) within a particular range (e.g.,corresponding to a desired separation between the inner bowl 1655 andthe outer bowl 1660, or a subset thereof), and may fit shapes for theinner bowl 1655 and the outer bowl so the identified detected objects(or identified points of the detected objects) fit within the inner bowl1655 and the outer bowl 1660. The 3D bowl may be parameterized byparameters of the inner bowl 1655 and/or the outer bowl 1660 (e.g.,short and long axes of the inner and outer bowls), so 3D bowl parametercontroller 1650 may effectively adapt the 3D bowl based on distance anddirection to detected object(s) by determining a set of 3D bowlparameters for each time slice. These are just a few examples, and otherways of adapting the inner bowl 1655 and/or the outer bowl 1660 based ondistance and direction to a detected object may be implemented (e.g., anexample of which is discussed in greater detail with respect to FIG.17B).

As such, the adapted 3D bowl may be used to render a surround viewvisualization (e.g., the output image 1670) of the input images 1610. Byway of nonlimiting example, the input images 1610 are aligned, stitched,and orthographically projected (or aligned, orthographically projected,and stitched) using corresponding depth values from the adapted 3D bowlto generate a top-down projection image corresponding to output image1670. In some embodiments, the projection image is mapped onto theadapted 3D bowl, or the input images 1610 are otherwise aligned,stitched, and projected onto the adapted 3D bowl (or aligned, projected,and stitched), to create a textured 3D bowl surface, a view of thetextured 3D bowl surface is rendered, and/or the rendered view ispresented on a monitor or other display device visible to an occupant oroperator (e.g., driver) of the ego-object (e.g., corresponding to theego-object 1627, 1640, and 1652).

Another example technique for using distance and direction to a detectedobject to adapt a 3D bowl uses the distance and direction to the closestdetected object in each radial direction (e.g., stored in, representedby, or identified using a 2D top-down occupancy grid) as a correspondingdimension of the 3D bowl (e.g., the radius of the inner bowl or theouter bowl in a corresponding direction). For example, the distance tothe closest detected object in each radial direction may be used as theradius (or may be used to fit a shape) for the inner bowl, and the outerbowl may be set to some fixed or dynamic separation from the inner bowl(e.g., 3 m) (or distances to the closest detected object may be used todetermine a shape for the outer bowl, and the inner bowl may beseparated from the outer bowl by some fixed or dynamic distance). Tosmooth out the bowl shape, a spatial filter may be applied over someangular range (e.g., 20-60°).

FIG. 17A is an example 2D top-down occupancy grid 1710, and FIG. 17Billustrates an example adaptive 3D bowl generation technique 1700 thatadapts a 3D bowl using the 2D top-down occupancy grid 1710, inaccordance with some embodiments of the present disclosure. Generally,dense sensor data (e.g., a LiDAR point cloud) captured by one or moresensors of an ego-object may be projected to a top-down view, and a gridof cells (with a designated ground resolution) may be populated with an(e.g., binary) indication of whether or not a sensor detection (e.g., aLiDAR point) landed on the cell during the projection. FIG. 17A shows anexample 2D top-down occupancy grid 1710 where the ego-object (e.g., avehicle) is represented in the middle of the 2D top-down occupancy grid1710, occupied cells in the grid (corresponding to detected objects) areillustrated as white, and the distances and directions to surroundingdetected objects are represented with grey lines extending radiallyoutward from the ego-object. FIG. 17B illustrates an example stitchedimage 1760 of the scene represented by the 2D top-down occupancy grid1710, where the example stitched image 1760 uses a fixed 3D bowl (e.g.,the initial 3D bowl 1720). The correspondence between three vehiclesrepresented in the 2D top-down occupancy grid 1710 and the stitchedimage 1760 are illustrated with grey arrows. Note how the vehicle 1765is distorted due to a scale magnification.

By contrast, in some embodiments, an initial 3D bowl 1720 (e.g., acircular 3D bowl) is deformed using the 2D top-down occupancy grid 1710to generate a deformed 3D bowl 1730. More specifically, for each angularincrement (e.g., 1°), the distance to the closest detected object inthat direction is determined using the 2D top-down occupancy grid 1710(e.g., by computing and retrieving a corresponding value from a radialdistance map that stores distance as a function of angle), and a radiusis set for the deformed 3D bowl 1730 in a corresponding direction toequal the distance to the closest detected object in that direction.

In some embodiments, a circle or ellipse is fitted to the radii of thedeformed 3D bowl 1730 to generate a filtered 3D bowl 1740. Additionallyor alternatively, the radii of the deformed 3D bowl 1730 (or acorresponding fitted shape) are filtered using one or more spatialand/or temporal filters to generate a filtered 3D bowl 1740 that hassmoothed radial values that change gradually as a function of angle(e.g., forming an elliptical 3D bowl). Generally, any number of spatialand/or temporal filters may be applied to parameters (or a combinationof parameters) representing the 3D bowl (e.g., radius). For example, atemporal filter may be applied over a temporal window (e.g., 30 framesof data buffered first-in-first-out) to a ratio of parameters (e.g., theratio of the short to long axes of an elliptical 3D bowl) to stabilizethe ratio, to individual parameters (e.g., short and long axes of aninner bowl), and/or otherwise. In some embodiments, temporal filteringis applied to a ratio before being applied to individual parameters inorder to apply course filtering before finer smoothing. In someimplementations, stochastic and/or Kalman filtering may be applied overa temporal window to smooth out transitions. These are just a fewexamples, and other filters may additionally or alternatively beapplied.

FIG. 17B illustrates an example stitched image 1750 of the scenerepresented by the 2D top-down occupancy grid 1710, where the examplestitched image 1750 uses the filtered 3D bowl 1740. Note how the vehicle1755 is less distorted in the stitched image 1750, reducing the scalemagnification present in the stitched image 1760.

In some embodiments, a stitched image or surround view visualization isgenerated from images captured during each time slice, and an adaptive3D bowl is generated for each time slice. In some embodiments, adetermination is made whether to use an adaptive 3D bowl during eachtime slice. For example, a state machine implementing a decision treemay be used to determine whether to use a fixed 3D bowl or an adaptive3D bowl that adapts its shape based on distance to detected object(s).For example, an adaptive 3D bowl may be selectively enabled in scenariosless likely to involve passing objects, for example, based on speed ofego-motion. This may be thought of as applying a speed filter to enableor disable adaptive 3D bowl (e.g., choosing between the filtered 3D bowl1740 and the initial 3D bowl 1720, or between the filtered 3D bowl 1740and a 3D bowl from a preceding time slice) based on speed of theego-object. In an example implementation, adaptive 3D bowl may bedisabled at speeds at or above a designated threshold speed (e.g., 16km/hr) and enabled at speeds below the designated threshold.

FIG. 18 is an example adaptive 3D bowl state machine 1800, in accordancewith some embodiments of the present disclosure. In this example, atblock 1810, ego-speed is determined (e.g., from a speed sensor). Ifego-speed is above a speed threshold (e.g., less than 1 km/hr), at block1820, the adaptive 3D bowl state machine 1800 determines whether theego-object is starting a journey (e.g., based on ego-speed increasingfrom 0 km/hr and not exceeding the lower speed threshold). If ego-speedis below the lower speed threshold and the ego-vehicle is starting on ajourney, at block 1830, the adaptive 3D bowl state machine 1800 enablesadaptive 3D bowl. If ego-speed is below the lower speed threshold andthe ego-vehicle is not starting on a journey (e.g., ego-speed decreasedfrom above the lower speed threshold), at block 1840, the adaptive 3Dbowl state machine 1800 disables adaptive 3D bowl. Example use cases forthis scenario include the ego-object slowing down and stopping at a roadintersection, stop sign, or traffic light; cross-traffic scenarios(e.g., another ego-object is approaching from the left or right andcrossing the ego-object's path, such as when the ego-object is backingout of a parking space); maneuvers at low speed; and/or others. Ifego-speed is above the lower speed threshold and below a higher speedthreshold (e.g., 16 km/hr), at block 1850, the adaptive 3D bowl statemachine 1800 enables adaptive 3D bowl. Example use cases for thisscenario include parking search and other driving maneuvers.

FIG. 19 is a diagram illustrating an example data flow 1900 through aSurround View System that includes adaptive 3D bowl and dynamic seamplacement state machines, in accordance with some embodiments of thepresent disclosure. At a high level, one or more sensors (e.g., cameras)of an ego-object (e.g., a vehicle) may be used to capture input images1610 and/or other sensor data during each time slice. For each timeslice, an adaptive 3D bowl state machine 1920 (e.g., which maycorrespond to the adaptive 3D bowl state machine 1800 of FIG. 18 ) maydetermine whether to enable or disable adaptive 3D bowl (e.g., based onego-speed 1933). When disabled, the adaptive 3D bowl state machine 1920may lock an applicable 3D bowl to a fixed bowl or the previous bowlshape from the preceding time slice. When enabled, the adaptive 3D bowlstate machine 1920 may trigger generation (e.g., by the adaptive 3D bowlgenerator 1600 of FIG. 16 , using the adaptive 3D bowl generationtechnique 1700 of FIG. 17B) of an adaptive 3D bowl 1930 (e.g., based ondistance(s) to detected object(s) 1934).

In some embodiments, the input images 1910 may be aligned (e.g., by thealignment component 510 of FIG. 5 ), creating one or more overlappingregions. A dynamic seam placement state machine 1940 (e.g., which maycorrespond to the dynamic seam placement state machine 600 of FIG. 6 ,the dynamic seam placement technique 700 of FIG. 7 , and/or some portionthereof) may determine, for each time slice, whether to use a defaultseam placement or dynamic seam placement (e.g., based on ego-speed 1933,distance(s) to detected object(s) 1934, direction of ego-motion,corresponding active viewport direction 1935), and a corresponding seamplacement may be determined for each overlapping region (e.g., by theseam placement component 520 of FIG. 5 ) in each overlapping region ofthe input images 1910. As such, the image images 1910 may be stitchedusing the determined seam placements for each overlapping region (e.g.,by the blending component 590 of FIG. 5 ).

In some embodiments, the stitched image may be orthographicallyprojected (e.g., to a top-down view) to generate a stitched projectionimage 1950, backprojecting each pixel of the stitched image into a 3Dscene representing the surrounding environment using correspondingdistances sampled from the adaptive 3D bowl 1930. As such, the stitchedprojection image 1950 may be mapped onto the adaptive 3D bowl 1930 (e.g.using UV mapping) to generate a textured 3D bowl 1960, and a view of thetextured 3D bowl 1960 (e.g., the viewport rendering 1970) through anapplicable, active, and/or designated viewport (e.g., corresponding tothe viewport direction 1935) may be rendered. In some embodiments, theviewport rendering 1970 is presented on a monitor or other displaydevice visible to occupants (e.g., driver) of the ego-object.

FIG. 20 shows example surround view visualizations 2010 and 2020 withand without an adaptive 3D bowl, in accordance with some embodiments ofthe present disclosure. The surround view visualization 2010 representsa rendering that uses a fixed 3D bowl, while the surround viewvisualization 2020 represents a rendering that relies on an adaptive 3Dbowl. Both surround view visualizations 2010 and 2020 represent the same3D scene (also represented by 2D top-down occupancy grid 2030) in whicha vehicle 2040 is parked between two adjacent vehicles 2050 and 2060.Note that the adjacent vehicles 2050 and 2060 are both distorted in thesurround view visualization 2010 that used a fixed 3D bowl, and that thedistortion of the adjacent vehicles 2050 and 2060 is reduced in thesurround view visualization 2020 that used an adaptive 3D bowl.

Now referring to FIGS. 21 and 22 , each block of methods 2100 and 2200,described herein, comprises a computing process that may be performedusing any combination of hardware, firmware, and/or software. Forinstance, various functions may be carried out by a processor executinginstructions stored in memory. The methods may also be embodied ascomputer-usable instructions stored on computer storage media. Themethods may be provided by a standalone application, a service or hostedservice (standalone or in combination with another hosted service), or aplug-in to another product, to name a few. In addition, the methods 2100and 2200 may be understood, by way of example, with respect to theadaptive 3D bowl generator 150 of FIG. 1 , the surround viewvisualization system 1300 of FIG. 13 , or the adaptive 3D bowl generator1600 of FIG. 16 . However, these methods may additionally oralternatively be executed by any one system, or any combination ofsystems, including, but not limited to, those described herein.

FIG. 21 is a flow diagram showing a method 2100 for generating asurround view visualization based at least on an adaptive 3D bowl, inaccordance with some embodiments of the present disclosure. The method2100, at block B2102, includes determining a distance from an ego-objectto a detected object in an environment. For example, with respect toFIG. 16 , the 3D object detector 1620 may perform 3D object detection onsensor data captured by the ego-object (e.g., the input images 1610,corresponding RADAR or LiDAR data), and the radial distance mapper 1630may compute distance(s) to the detected objects (e.g., the closestdetected object in a direction corresponding to each angular increment).In another example, the adaptive 3D bowl generator 150 of FIG. 1 (e.g.,the depth estimator 155 of the adaptive 3D bowl generator 150) mayproject sensor data captured by the ego-object such as a LiDAR pointonto a top-down 2D occupancy grid that represents locations of detectedobjects, and the radial distance mapper 1630 may compute distance(s) tothe detected objects using the top-down 2D occupancy grid.

The method 2100, at block B2120, includes generating an adaptivethree-dimensional (3D) bowl that models the environment with a shapebased at least on the distance to the detected object. For example, withrespect to FIG. 16 , the 3D bowl parameter controller 1650 may use thedistance to the detected object to size the ground plane (e.g., theinner bowl) of the adaptive 3D bowl to fit within the distance to theclosest detected object. In some embodiments, the 3D bowl parametercontroller 1650 may modify the shape, size (e.g., along one or moredimensions or axes), and/or orientation (e.g., rotation relative to anego-vehicle) of the 3D bowl to place (e.g., one or more points thatbelong to) the closest detected object(s) between the ground plane ofthe bowl and the outer edge or rim of the bowl. Depending on theembodiment, the 3D bowl parameter controller 1650 may use distance toone or more detected objects to fit a (e.g., symmetrical) bowl shape,synthesize a shape using local deformations to accommodate differentdistances to different detected objects, apply one or more temporalfilters, apply one or more spatial filters, and/or otherwise.

The method 2100, at block B2130, includes generating a surround viewvisualization of the environment based at least on the adaptive 3D bowl.For example, with respect to FIG. 13 , the projection module 1335 maygenerate an orthographic projection of image data captured by theego-object (e.g., aligned image data, stitched image data) usingcorresponding depth values from the adaptive 3D bowl to generate a(e.g., stitched) top-down projection image. In some embodiments, the UVmapping module 1345 may map image data captured by the ego-object (e.g.,a top-down projection image) onto the adaptive 3D bowl to generate atextured 3D bowl (e.g., the textured 3D bowl 1340), and the viewportrenderer 1355 may position and orient a virtual camera in a 3D scenewith the textured 3D bowl 1340 and use the virtual camera to render aview (e.g., image 1350) of the textured 3D bowl 1340 for a correspondingviewport

FIG. 22 is a flow diagram showing a method 2200 for generating asurround view visualization based at least on an adaptive 3D bowl with ashape that depends on distance to the closest detected object, inaccordance with some embodiments of the present disclosure. The method2200, at block B2202, includes generating a top-down two-dimensional(2D) occupancy grid based at least on projecting LiDAR data into atop-down view of an environment. For example, with respect to FIG. 1 ,the adaptive 3D bowl generator 150 (e.g., the depth estimator 155 of theadaptive 3D bowl generator 150) may project sensor data captured by anego-object such as a LiDAR point onto a top-down 2D occupancy grid thatrepresents locations of detected objects.

The method 2200, at block B2220, includes determining distance to aclosest detected object as a function of angle using the top-down 2Doccupancy grid. For example, the adaptive 3D bowl generator 150 of FIG.1 (e.g., the depth estimator 155 of the adaptive 3D bowl generator 150)or the radial distance mapper 1630 of FIG. 16 may compute distance(s) tothe closest detected object as a function of angle (e.g., representing arotation around an axis of the coordinate system of the ego-object, suchas yaw) by computing the number of cells and/or the correspondingdistance between the position of the ego-object and the closest occupiedcell in a corresponding direction in the top-down 2D occupancy grid.

The method 2200, at block B2230, includes generating an adaptivethree-dimensional (3D) bowl that models the environment with a shapethat depends on the distance to the closest detected object and theangle. For example, with respect to FIG. 16 , the 3D bowl parametercontroller 1650 may use the distance to the detected object and thecorresponding angle to size the ground plane (e.g., the inner bowl) ofthe adaptive 3D bowl to fit within the distance to the closest detectedobject in that direction. In some embodiments, the 3D bowl parametercontroller 1650 may modify the shape, size (e.g., along one or moredimensions or axes), and/or orientation (e.g., rotation relative to anego-vehicle) of the 3D bowl to place (e.g., one or more points thatbelong to) the closest detected object(s) between the ground plane ofthe bowl and the outer edge or rim of the bowl. Depending on theembodiment, the 3D bowl parameter controller 1650 may use distance toone or more detected objects to fit a (e.g., symmetrical) bowl shape,synthesize a shape using local deformations to accommodate differentdistances to different detected objects, apply one or more temporalfilters, apply one or more spatial filters, and/or otherwise.

Under Vehicle Reconstruction

FIG. 23 is a diagram illustrating an example data flow through an undervehicle reconstruction system 2300, in accordance with some embodimentsof the present disclosure. At a high level, the under vehiclereconstruction system 2300 uses cached sensor data from sensor(s) 2301(e.g., image data from one or more cameras) of an ego-object (e.g., avehicle) and ego-motion data 2340 of the ego-object to reconstruct thearea under the ego-object (e.g., the example autonomous vehicle 3500 ofFIGS. 35A-35D).

In the example illustrated in FIG. 12 , an ego-object may be equippedwith any number of sensor(s) 2301 (e.g., one or more cameras, such asfisheye cameras, which may correspond to the sensor(s) 101 of FIG. 1 ).For each set of frames of sensor data from the sensor(s) 2301 (e.g.,each set of images) representing a particular time slice, an undervehicle reconstructor 2315 (e.g., which may correspond to the undervehicle reconstructor 130 of FIG. 1 ) may update a representation of theground plane (e.g., composite map 2335) that caches one or morevisualizations (e.g., a plurality of local maps, a composite map thatmerges a plurality of local maps) of a navigable area using the sensordata from the sensor(s) 2301, and may use ego-motion data 2340 of theego-object to retrieve a portion of the representation of the groundplane (e.g., corresponding pixels of composite map 2335, correspondingpixels from a plurality of local maps) corresponding to the ego-object'scurrent position to virtually reconstruct the area under the ego-object.In some embodiments, the under vehicle reconstructor 2315 selects andcaches image data (e.g., the image data 105 of FIG. 1 ) from one of thesensor(s) 2301 corresponding to the direction of ego-motion of theego-object. For example, if a vehicle is moving forward, image data froma front-facing (e.g., fisheye) camera may be used, and/or if a vehicleis moving backward, image data from a rear-facing (e.g., fisheye) cameramay be used. In some embodiments, the stitching module 2305 (e.g., whichmay correspond to the stitching module 120 of FIG. 1 ) may stitch imagedata from multiple sensor(s) 2301 (e.g., multiple fisheye images) into acomposite or stitched image, and the under vehicle reconstructor 2315selects and caches a portion of the composite or stitched imagecorresponding to the direction of ego-motion. As such, the under vehiclereconstructor 2315 may use the ego-object's ego-motion data 2340 toretrieve a corresponding portion(s) of the cached local maps and/or thecomposite map 2335 (the under vehicle reconstruction or UVR).

In some embodiments, the stitching module 2305 may stitch the UVR withimage data from the sensor(s) 2301 to generate a composite stitchedimage, and the projection module 2350 (e.g., which may correspond withthe projection module 175 of FIG. 1 ) may project the composite stitchedimage onto a 2D or 3D representation of the environment surrounding theego-object (e.g., a top-down or ground 2D projection, a textured 3Dsurface such as a 3D bowl). In other embodiments, the projection module2350 may project image data from the sensor(s) 2301 and the UVR onto the2D or 3D representation of the environment, and the projected image datamay be stitched. As such, the view generator 2355 (e.g., which maycorrespond with the view generator 180 of FIG. 1 ) may generate arendered view 2360 of the projected image data from the perspective of avirtual camera (e.g., based on a pre-configured virtual camera positionand orientation, based on the state of the ego-object), and the renderedview 2360 (e.g., a 2D image) may be presented on a monitor visible to anoperator or occupant of the ego-object.

In FIG. 23 , the under vehicle reconstructor 2315 includes a mapgenerator 2320 (comprising a local map generator 2325 and a compositemap updater 2330) and a local map retriever 2345. In some embodiments,the local map generator 2325 uses the ego-motion data 2340 to identify adirection of ego-motion and construct a local map of the environment,the ground, and/or the navigable space (e.g., the road), where the localmap faces in the direction of ego-motion and may include a length thatdepends on the speed of ego-motion. The cells in the grid may have known3D coordinates in world space based on a specified length, width, andgrid spacing. The local map generator 2325 may texturize the local mapwith image data (e.g., from the sensor(s) 2301 and/or the stitchingmodule 2305) by assigning a color value to each cell in the grid from acorresponding pixel of the image data, as explained in more detailbelow.

In some embodiments, one or more deep neural networks (DNNs) (notdepicted in FIG. 23 ) are used to segment the image data from thesensor(s) 2301 or the stitched image data from the stitching module 2305to generate navigable space segmentation data 2310 distinguishing anavigable (e.g., drivable) space from moving (or moveable) objects, suchas pedestrians, bicycles, and vehicles, represented in the image data.As such, the local map generator 2320 may omit color values for pixelsthat are part of moving (moveable) objects from the local map and/or mayonly include color values for pixels that are part of the navigablespace so the local map represents the navigable space (e.g., the road)without the moving (moveable) objects.

As such, the composite map updater 2330 may cache the local mapgenerated for any particular time slice by merging the local map into acomposite map 2335 that stores a composite representation of local mapsgenerated for previous time slices. Any known merging, fusing, blending,stitching, or combining technique may be applied to combine data fromoverlapping regions. In some embodiments, the composite map updater 2330may apply one or more image filters to fuse sensor data from differentsources of same modality, and/or from different modalities, where thefusing may be weighted based on a detected measure of image quality. Insome embodiments, to save memory, the composite map updater 2330 maylimit the size of the composite map 2335 (e.g., limited to representinglocal maps generated for a designate d number of time slices or framesof image data, with a length that is limited based on speed ofego-motion). Although the embodiment illustrated in FIG. 23 contemplatesmerging the local map into the composite map 2335, some embodimentsadditionally or alternatively cache each local map in association with arepresentation of the ground plane, such that the representation of theground plane (e.g., the composite map 2335) stores multiple unmergedcolor values for certain pixels in regions where overlapping local mapswere cached. For example, each pixel of the composite map 2335 may storeone or more color values (e.g., merged color value, one or more unmergedcolor values) in different dimensions or layers, or may otherwisereference the one or more color values).

As such, the local map retriever 2345 may (e.g., use the ego-motion data2340 to) determine the position of the ego-object in world coordinates,and construct a local map representing the region underneath theego-object textured with image data retrieved from the composite map2335 (the UVR). For example, a grid may be defined under the ego-objectwith a specified length, width, and grid spacing, so the cells in thegrid have known 3D coordinates in the world space. As such, the localmap retriever 2345 may texturize the UVR with image data from thecomposite map 2335 by assigning a color value to each cell in the gridfrom a corresponding pixel of the image data (e.g., a merged pixelcached in the composite map 2335, a composite value such as an averagevalue generated from multiple unmerged color values cached fromdifferent local maps in a particular pixel in the composite map 2335),as explained in more detail below.

The resulting UVR may be visualized in various ways. In the embodimentillustrated in FIG. 23 , the stitching module 2305 stitches the UVR witha composite image representing a composite view of the surroundingenvironment to generate a stitched image, and the projection module 2350projects the stitched image onto a representation of the environment,such as a 3D surface (e.g., a 3D bowl) to generate a textured 3Dsurface. In some embodiments, a 3D model of the ego-vehicle may berendered over the textured 3D surface (e.g., as a transparent vehicle,with a transparent hood, with a transparent trailer). FIG. 24 is anexample surround view visualization 2400 with a transparent vehicle 2410rendered over an under vehicle reconstruction 2420 on a textured 3Dsurface 2430 (e.g., a 3D bowl), in accordance with some embodiments ofthe present disclosure.

In order to facilitate under vehicle (or under object) reconstruction,the environment surrounding the vehicle may be modeled as a geometricmodel in world space, and the geometric model may be used to align imagedata from the sensor(s) 2301, the ego-motion data 2340, and/or a 3Dmodel of the vehicle. Typically, different sensors have their own 3Dcoordinate systems. As such, some embodiments align coordinate systemsfor the sensor(s) 2301 and the vehicle (e.g., a vehicle rig) withrespect to the world space. FIG. 25 shows example vehicle rig and cameracoordinate systems, in accordance with some embodiments of the presentdisclosure. In this example, a vehicle rig coordinate system 2510 isdefined with an origin O_(RIG) located at the ground projection of thecenter of the rear vehicle axis 2520. The vehicle rig coordinate system2510 includes an X_(RIG) axis that points to front of the vehicle, aY_(RIG) axis that points to the left, and a Z_(RIG) axis that pointsupward. Vehicle ego-motion may be defined as the pose of the vehicle rigcoordinate system 2510 with respect to a world space. Camera extrinsicparameters (e.g., location and orientation) may be calibrated relativeto the vehicle rig coordinate system 2510 to align each camera'scoordinate system (e.g., front camera coordinate system 2530) in theworld space.

In some embodiments, the vehicle's surrounding environment is modeledusing a 3D surface, such as a 3D bowl. FIG. 26 illustrates across-section view of an example 3D bowl, in accordance with someembodiments of the present disclosure. The 3D bowl includes an innerbowl 2610 which may be a circle or ellipse positioned on the groundplane in world space, and an outer bowl 2620 that rises up from theground plane. In an example embodiment, the inner bowl 2610 is a circlewith a radius R_(INNER) and a height value in world space (e.g., Z_(RIG)value in the vehicle rig coordinate system) of zero. The outer bowl 2620may be a monotonic increasing curve defined by a function g(r) where ris the distance to the bowl center. In an example embodiment, the heightvalue of a point P(x_(p),y_(p)) on the outer bowl 2620 in the vehiclerig coordinate system is given by:

g(r)=c ₁ *r ^(c) ²   (Eq. 1)

where √{square root over ((x_(p)−x_(o))²+(y_(p)−y_(o))²))},(x_(o),y_(o)) are the coordinates of the center of the 3D bowl in thevehicle rig coordinate system, c₁ is a constant that controls the scaleof the 3D bowl's height, and c₂ is a parameter that controls the shapeof the curve (e.g., c₂=2 for a quadratic curve).

FIG. 26 also includes a virtual camera coordinate system 2650 thatrepresents a position and orientation of a virtual camera in worldspace, such that a 2D view (e.g., an image) of the 3D bowl may berendered from the perspective of the virtual camera. In someembodiments, an image generated using a front camera (with a positionand orientation in world space corresponding to a camera coordinatesystem 2660) may be projected and/or mapped onto the 3D bowl,effectively texturing the 3D bowl with image data, such that anarbitrary point P4 on the 3D bowl may be assigned a pixel valuedetermined by the front camera using physical world point P₂'sinformation. As a result, rendering a view of the textured 3D bowleffectively images much more of the surrounding environment within thefield of view of the virtual camera than without using a 3D bowl model.

FIG. 26 also illustrates an under vehicle area 2630 which is outside ofthe field of view of the front camera. To facilitate reconstructing theunder vehicle area 2630, in some situations (e.g., when the vehicle istraveling forward), a local map 2640 extending in front of the vehiclemay be generated (e.g., by the local map generator 2325 of FIG. 23 ). Insome embodiments, the local map 2640 is a textured 2D planar surface,and may be representing using a grid. The grid may have a specifiedlength, width, and grid spacing that define the resolution of the localmap 2640. By way of non-limiting example, each cell of the grid may be asquare with some fixed or variable length (e.g., 1 cm×1 cm). Althoughsome embodiments involve a rectangular local map 2640, other shapes arepossible.

For any time slice, the cells in the grid may be assigned color valuesfrom an image captured in that time slice by a camera pointing in thedirection of ego-motion. For example, if the vehicle is moving forward,an image captured using a front camera on the vehicle may be used topopulate color values in the local map 2640. In some cases, the gridspacing of the local map 2640 represents a distance, while the unit ofthe image may be a pixel. Since each pixel in the image defines a ray,the ray may be back projected onto the ground plane onto a correspondingcell of the local map 2640. As such, a color value for a particular cellin the grid of the local map 2640 may be identified by retrieving thecolor value of the corresponding pixel of the image. In someembodiments, the process is repeated to populate each cell in the gridwith a corresponding pixel value. In other embodiments, color values forpixels that do not belong to a navigable space (e.g., a road) (e.g., asdenoted by navigable space segmentation data 2310) are omitted from thelocal map 2640. Taking a rectangular local map 2640 in front of thevehicle as an example, the distance between the furthermost map edge(the front edge) to the front of the vehicle may be denoted as d_(map)(as illustrated in FIG. 26 ), and the width of the map may be denoted asw_(map). In this example, the process of creating the local map 2640 mayretrieve a value (or determine not to) for each cell within thisd_(map)×w_(map) map area.

In some embodiments that involve fisheye cameras, the distortionintroduced by the camera's fisheye lens may be represented using theF-theta model, and the lens's distortion coefficients may be given byk₀, k₁, . . . , k_(N) (e.g., N=4). For any cell in the grid of the localmap 2640, the cell may be represented by a point P_(i)(r_(x), r_(y),r_(z)), where r_(x), r_(y), and r_(z) are the point's coordinates in thevehicle rig coordinate system. Taking the front camera centered at theorigin O_(CAM_f) of the front camera coordinate system 2660 and havingan optical axis Z_(CAM_f) pointing forward and parallel to the groundplane, the angle θ between (i) a ray vector P_(i)O_(CAM_f) pointing fromthe camera center to the point P_(i) (e.g., ray 2670 for point P₂) and(ii) the front camera optical axis Z_(CAM_f) may be given by:

θ=cos⁻¹(r _(z)/√{square root over (r _(x) ² +r _(y) ² +r _(z) ²))}  (Eq.2)

Within a fisheye image generated by the front camera, the distance f(0)between an imaged point p_(i) (in the fisheye image) that images pointP_(i) and the fisheye principal point (u₀, v₀) (given as part of theintrinsic parameters) may be given by:

f(θ)=k ₀ +k ₁ θ+k ₂θ² + . . . +k _(N)θ^(N)  (Eq. 3)

As such, in this example, the color value for a particular cell in thegrid of the local map 2640 may be retrieved from the pixel in thefisheye image whose coordinates are given by:

$\begin{matrix}\left\{ \begin{matrix}{x = {u_{0} + {{f(\theta)}{r_{x}/\sqrt{r_{x}^{2} + r_{y}^{2}}}}}} \\{y = {\upsilon_{0} + {f(\theta){r_{y}/\sqrt{r_{x}^{2} + r_{y}^{2}}}}}}\end{matrix} \right. & \left( {{{Eqs}.4}{and}5} \right)\end{matrix}$

In some embodiments, the grid of the local map 2640 may be populatedusing a forward approach (e.g., for each 3D point in the 2D grid,identify its corresponding imaged 2D point) or a backward approach(e.g., for each pixel/imaged 2D point in an image, identify itscorresponding 3D point in the 2D grid). Some embodiments may favor theforward approach since it may identify a corresponding pixel for everycell in the grid without missing any cells.

Continuing with the example using an image generated with the frontcamera, since the field of view of the front camera typically does notinclude the under vehicle area 2630, the local map 2640 generated for aparticular time slice will not typically represent the under vehiclearea 2630 at that time slice. As such, the local map 2640 generated fora particular time slice may be cached by merging it into a composite map(e.g., by the composite map updater 2330 of FIG. 23 ) for future use. Inorder to reconstruct the under vehicle area 2630 for the current timeslice, a previously constructed corresponding portion of the compositemap may be retrieved (e.g., by the local map retriever 2345 of FIG. 23). As such, once multiple local maps have been generated and mergedtogether into the composite map, and the vehicle travels over a regionof the composite map that is populated with a local map from a previoustime slice, the under vehicle area may be retrieved.

More specifically, the local map constructed for each time slice t_(i),i=0, 1, . . . , T may be denoted as M_(t) _(i) , and the updatedcomposite map that merges (e.g., all, n) previous local maps may bedenoted as M_(t) _(i) ′. At time t_(i+1), a map updating process (e.g.,executed by the composite map updater 2330 of FIG. 23 )) may createM_(t) _(i+1) ′ by combining the new local map M_(t) _(i+1) with M_(t)_(i) ′, for example, as:

M _(t) _(i+1) ′=M _(t) _(i) ′∪M _(t) _(i+1)   (Eq. 6)

where ∪ is a map union operator that merges two aligned maps together(e.g. using blending).

FIG. 27 illustrates an example technique for updating a composite mapthat visualizes a drivable area, in accordance with some embodiments ofthe present disclosure. Assume a vehicle 2710 is moving along a path2720, and both vehicle position and orientation change as the vehicle2710 moves along the path 2720. O_(t) _(i) represents the origin of thevehicle rig coordinate system of the vehicle 2710 at the time t_(i).Local map M_(t) _(i) is the rectangular area in front of the vehicle2710 at time t_(i) (e.g., local map 2730 is M_(t) _(i) at time t₁). Asthe vehicle 2710 moves forward, each newly perceived local map M_(t)_(i) may be continuously digested into a composite map formed by theunion of the previously created local maps. When the vehicle 2710 moveson top of a populated area of the composite map, the area under thevehicle 2710 may be available for retrieving.

Vehicle ego-motion (e.g., the vehicle ego-motion data 2340 of FIG. 23 )may provide the relative pose relationship between vehicle rig originO_(t) _(i) and O_(t) _(i+1) at times t_(i) and t_(i+1), respectively,and the relative pose relationship may be used to geometrically align anew local map M_(t) _(i+1) and the previously created composite mapM_(t) _(i) ′. More specifically, the translation vectors of the vehiclerig coordinate system at times t_(i) and t_(i+1) may be denoted as T_(t)_(i) and T_(t) _(i+1) , and the respective rotation matrices may bedenoted as R_(t) _(i) and R_(t) _(i+1) . For each point in the worldcoordinate system P_(w), its corresponding points P_(t) _(i) and P_(t)_(i+1) in the vehicle rig coordinate system at times t_(i) and t_(i+1)may be represented as:

$\begin{matrix}\left\{ \begin{matrix}{P_{w}\  = {{R_{t_{i}}P_{t_{i}}}\  + T_{t_{i}}}} \\{P_{w} = {{R_{t_{i + 1}}P_{t_{i + 1}}}\  + T_{t_{i + 1}}}}\end{matrix} \right. & \left( {{{Eqs}.7}{and}8} \right)\end{matrix}$

By equating equations 7 and 8 and solving for P_(t) _(i+1) , the valueof P_(t) _(i+1) at time t_(i+1) may be retrieved as:

R _(t) _(i+1) =R _(t) _(i+1) ⁻¹(R _(t) _(i) P _(t) _(i) +T _(t) _(i) −T_(t) _(i+1) )  (Eq. 9)

where T_(t) _(i) , T_(t) _(i+1) , R_(t) _(i) , and R_(t) _(i+1) may beincluded in, or determined based on, vehicle ego-motion.

Equation 9 establishes a geometric transformation relationship between anew local map M_(t) _(i+1) and a previously constructed composite mapM_(t) _(i) ′, which provides a correspondence between any point in M_(t)_(i+1) and its corresponding point in M_(t) _(i) ′, and vice versa. Assuch, once enough of the composite map has been constructed, at any timet_(i+1), the location of the vehicle 2710 in world space may bedetermined (e.g., using vehicle ego-motion), a desired area under thevehicle divided into a grid of cells may be identified, and thecoordinates of each cell in the world space may be identified (e.g.,based on the relationship between the vehicle rig coordinate system, thesize of the vehicle, and/or the ground). Each cell in the under vehiclearea may be represented as a point in world space that moves from P_(t)_(i) to P_(t) _(i+1) from time t_(i) to t_(i+1). As such, a pixel valuefor a point P_(t) _(i+1) in the under vehicle area at time t_(i+1) maybe retrieved from a corresponding pixel in the composite map identifiedusing the geometric transformation defined by equation 9. As such, colorvalues for the cells in the grid in the under vehicle area may beretrieved from the composite map to reconstruct the under vehicle area.

In some embodiments, this UVR may be used to fill a missing undervehicle area in a textured 3D surface (e.g., a textured 3D bowl) usingthe updated composite map. In some cases, a UVR may be stitched into acomposite image (e.g., a panorama, 360° image, composite groundprojection), and the composite image with the UVR may be projected ontothe 3D surface (e.g., the 3D bowl). As such, a 3D bowl may be fullytextured, and a view of the textured 3D bowl may be rendered with aspecified viewport.

In some embodiments, at the very beginning of a trip, there may not beanything cached in a composite map until the vehicle starts to move andthe vehicle moves over an area that was previously perceived (e.g., bycamera frames). In some embodiments, a cached composite map from aprevious trip may be used to initialize a composite map for initialUVR(s) for a new trip.

FIG. 28 shows an example under vehicle reconstruction 2830 usingsimulated fisheye images 2810 a-d, in accordance with some embodimentsof the present disclosure. In this example, an ego-vehicle and its fourfisheye cameras are simulated with a virtual front fisheye camerainstalled in the center of front bumper area, left and right virtualfisheye cameras installed inside the mirror frames, and a virtual rearfisheye camera installed at the back of the ego-vehicle above thelicense plate. This example uses global ego-motion to provide theego-vehicle's pose with respect to a pre-defined world space (e.g., theglobal coordinate system for the simulated 3D environment). For each setof synchronized simulated fisheye images 2810 a-d from the four virtualfisheye cameras, a corresponding simulated ego-motion entry specifiesthe ego-vehicle rig's position and orientation within the globalcoordinate system. The image 2820 represents a top-down (e.g., a birds'eye) view of a 3D bowl model textured with the four simulated fisheyeimages 2810 a-d and an example under vehicle reconstruction 2830. Inthis example, the circle 2840 represents the inner circle of the 3D bowlmodel, and the under vehicle reconstruction 2830 fills in the area underthe vehicle. Note that the white arrow in the center of the image 2820includes two parts, the top of which is from the front simulated fisheyeimage 2810 a, and the bottom which is from the under vehiclereconstruction 2830. As such, the image 2820 aligns these two partstogether and provides a complete visualization of the arrow shape.

FIG. 29 shows an example under vehicle reconstruction 2930 using realfisheye images 2910 a-d, in accordance with some embodiments of thepresent disclosure. In this example, four fisheye cameras are installedat the four sides of a real car and are calibrated against the vehiclerig coordinate system, and relative ego-motion provides a relativevehicle rig transformation matrix (e.g., representing rotation andtranslation matrices) for nearby time slices (e.g., timestamps) whenfisheye images are captured. In this example, the car is driven on aroad with the landmark “STOP” letters printed on its surface. Image 2920represents rendered top-down view of a 3D bowl model textured with thefour fisheye images 2910 a-d and an example under vehicle reconstruction2930. In this example, the circle 2940 represents the inner circle ofthe 3D bowl model, and the under vehicle reconstruction 2930 fills inthe area under the vehicle. Note that the “STOP” letters include intotwo parts. The left bottom part is under the vehicle and invisible toall the four fisheye cameras, while the rest of the letters are withinthe field of view of the front and right fisheye cameras. In the finalreconstructed result in the center of the image 2920, these parts arealigned, making the letters clearly recognizable from a bird's eye view.

As such, a UVR may be used to fill in a missing blind-spot in a surroundview visualization. In some embodiments, a 3D model of the vehicle maybe rendered on top of a UVR with a (configurable) degree oftransparency, allowing an operator to see through a 3D model of thevehicle from the perspective of a virtual camera positioned outside (oreven inside) the 3D model, which may be beneficial in various use cases,such as assisting lane changing in the freeway.

Now referring to FIGS. 30 and 31 , each block of methods 3000 and 3100,described herein, comprises a computing process that may be performedusing any combination of hardware, firmware, and/or software. Forinstance, various functions may be carried out by a processor executinginstructions stored in memory. The methods may also be embodied ascomputer-usable instructions stored on computer storage media. Themethods may be provided by a standalone application, a service or hostedservice (standalone or in combination with another hosted service), or aplug-in to another product, to name a few. In addition, methods 3000and/or 3100 may be understood, by way of example, with respect to thesystem 100 of FIG. 1 or the under vehicle reconstruction system 2300 ofFIG. 23 . However, these methods may additionally or alternatively beexecuted by any one system, or any combination of systems, including,but not limited to, those described herein.

FIG. 30 is a flow diagram showing a method for virtually reconstructingan area of a ground plane under an ego-object, in accordance with someembodiments of the present disclosure. The method 3000, at block B3002,includes generating, using image data generated using one or morecameras of an ego-object during a first time slice, a local maprepresenting an observed portion of a ground plane of an environmentsurrounding the ego-object. For example, with respect to FIG. 23 , thelocal map generator 2325 may use the ego-motion data 2340 to identify adirection of ego-motion of the ego-object and construct a local map ofthe ground plane and/or the navigable space (e.g., the road). The localmap may face in the direction of ego-motion and may include a lengththat depends on the speed of ego-motion. The local map generator 2325may texturize the local map with the image data generated using one ormore cameras by assigning a color value to each cell in the grid from acorresponding pixel of the image data.

The method 3000, at block B3004, includes updating a representation ofthe ground plane based at least on the local map. For example, withrespect to FIG. 23 , the composite map updater 2330 may merge the localmap into a composite representation of local maps generated for previoustime slices. Any known merging, fusing, blending, stitching, orcombining technique may be applied to combine data from overlappingregions. In another example, the composite map updater 2330 mayseparately cache each of a plurality of local maps.

The method 3000, at block B3006, includes virtually reconstructing anarea of the ground plane under the ego-object during the first timeslice based at least on retrieving a corresponding portion of therepresentation of the ground plane. For example, with respect to FIG. 23, the local map retriever 2345 may determine the position of theego-object in world coordinates and construct a local map representingthe region underneath the ego-object textured with image data retrievedfrom a composite map. For example, for any given pixel, the local mapretriever 2345 may retrieve a merged value from a compositerepresentation of the ground plane and/or a plurality of separatelycached values from multiple cached local maps. In the latter instance,the local map retriever 2345 may merge (e.g., average) the plurality ofseparately cached values to generate a merged value for the given pixel.

FIG. 31 is a flow diagram showing a method for merging a local map intoa representation of the ground plane for virtually reconstructing anarea of the ground plane under an ego-object, in accordance with someembodiments of the present disclosure. The method 3100, at block B3102,includes merging, into a representation of a ground plane representingportions of the ground plane observed by one or more sensors of anego-object, a local map representing a portion of the ground planeobserved by the one or more sensors during a first time slice. Forexample, with respect to FIG. 23 , the local map generator 2325 may usethe ego-motion data 2340 to identify a direction of ego-motion of theego-object and construct a local map of the ground plane and/or thenavigable space (e.g., the road) facing in the direction of ego-motion,and the composite map updater 2330 may merge the local map into acomposite representation of local maps generated for previous timeslices using any known merging, fusing, blending, stitching, orcombining technique to combine data from overlapping regions.

The method 3100, at block B3104, includes virtually reconstructing anarea of the ground plane under the ego-object during the first timeslice based at least on retrieving a corresponding portion of therepresentation of the ground plane. For example, with respect to FIG.23, the local map retriever 2345 may determine the position of theego-object in world coordinates and construct a local map representingthe region underneath the ego-object textured with image data retrievedfrom the composite map.

Optimized Visualization Streaming

One or more embodiments are directed to techniques for streaming arepresentation of an environment in and/or around an ego-object (e.g., avehicle, such as the example autonomous vehicle 3500 of FIGS. 35A-35D).For example, systems and methods are disclosed that stream, render,and/or otherwise deliver a representation of various types of sensordata to a remote location to facilitate various remote experiences. Thepresent techniques may be utilized to visualize and/or stream arepresentation of an environment in and around an ego-object, such as avehicle, robot, and/or other type of object, in systems such as parkingvisualization systems, Surround View Systems, and/or others.

FIG. 32 is an example representation streaming system 3200, inaccordance with some embodiments of the present disclosure. At a highlevel, the representation streaming system 3200 includes an object 3201,which may be an ego-object such as a vehicle, robot, or person; astationary object such as a sign, pole, wall, or bridge. The object 3201may be stationary or traverse through an environment, and arepresentation streaming engine 3205 associated with the object 3201 mayuse one or more sensors (e.g., such as cameras, microphones, ultrasonicsensors, RADAR sensors, LiDAR sensors, infrared sensors) to capturesensor data representing the environment and stream a representation ofthe sensor data to a remote device 3270 via a network 3260 that includesat least one wireless communication channel. Depending on theembodiment, the streamed representation of the environment may take avariety of forms and may facilitate various remote experiences at theremote device 3270, such as streaming to a remote viewer (e.g., a friendor relative) operating or wearing the remote device 3270, streaming to aremote or fleet operator using the remote device 3270, streaming to amobile app on the remote device 3270 configured to self-park or summonthe object 3201 (e.g., in embodiments in which the object 3201 is anautonomous vehicle, such as the autonomous vehicle 3500 of FIGS.35A-35D), rendering a 3D augmented reality (AR) or virtual reality (VR)representation of the physical environment on the remote device 3270,and/or others. In some embodiments, the representation streaming engine3205 may operate a stream that includes one or more command channelsused to control data collection, rendering, stream content, and/orobject maneuvers, such as during an emergency, self-park, or summonscenario.

Generally, the object 3201 may be, include, or be associated with acomputing device that includes the representation streaming engine 3205.In some embodiments, the representation streaming engine 3205 (or aportion thereof) may correspond to the representation streaming engine195 of FIG. 1 (or a portion thereof). Depending on the embodiment, thecomputing device with the representation streaming engine 3205 and/orthe remote device 3270 may take various forms, such as a vehiclecomputer system, a monitoring system, an embedded system controller, acamera, a smartphone, a wearable computer (e.g., a smart watch, an AR/VRheadset), a handheld communications device, a consumer electronicdevice, a workstation, a desktop computer, a laptop computer, a tabletcomputer, a server, and/or or any other suitable computing device.

In the implementation illustrated in FIG. 32 , the representationstreaming engine 3205 communicates with one or more computing devices,such as the remote device 3270 and/or a server (not shown in FIG. 32 ),via the network 3260. The network 3260 may include one or more localarea networks (LANs), wide area networks (WANs), wireless communicationchannels, and/or other communication networks, as will be understood bythose of ordinary skill in the art.

Depending on the embodiment, various allocations of functionality areimplemented across any number and/or type(s) of devices. In the exampleillustrated in FIG. 32 , the representation streaming engine 3205 andthe remote device 3270 are illustrated with various components. In otherexamples, one or more of the components of the representation streamingengine 3205 and/or the remote device 3270 (or some portion thereof) areexecuted elsewhere (e.g., as part of the representation streaming engine3205, the remote device 3270, a server between the representationstreaming engine 3205, the remote device 3270). In another example, oneor more components of the representation streaming engine 3205 and/orthe remote device 3270 (or some portion thereof) are distributed acrosssome other number and/or type(s) of devices (e.g., hosted on a server,such as one operating within a communication channel between therepresentation streaming engine 3205 and the remote device 3270) thatcoordinate via the network 3260 to execute the functionality describedherein. These are just examples, and any suitable allocation offunctionality among these or other devices is possible within the scopeof the present disclosure.

Although not depicted in FIG. 32 , the representation streaming engine3205 may include or receive sensor data generating using any number andtype of sensor, such as global navigation satellite systems (“GNSS”)sensor(s) 3558 (e.g., Global Positioning System sensor(s)), RADARsensor(s) 3560, ultrasonic sensor(s) 3562, LIDAR sensor(s) 3564,inertial measurement unit (IMU) sensor(s) 3566 (e.g., accelerometer(s),gyroscope(s), magnetic compass(es), magnetometer(s), etc.),microphone(s) 3596, stereo camera(s) 3568, wide-view camera(s) 3570(e.g., fisheye cameras), infrared camera(s) 3572, surround camera(s)3574 (e.g., 360° cameras), long-range and/or mid-range camera(s) 3598,speed sensor(s) 3544 (e.g., for measuring the speed of the vehicle3500), vibration sensor(s) 3542, steering sensor(s) 3540, brakesensor(s) (e.g., as part of the brake sensor system 3546), and/or othersensor types.

To begin with a high level overview, in some embodiments, therepresentation streaming engine 3205 includes a surround view system3210, an audio system 3220, a salient event detector 3230, a streamtransport system 3240, and/or a remote controlled maneuvering component3250. The surround view system 3210 (e.g., which may correspond to thesurround view system 100 of FIG. 1 ) may use one or more sensors (e.g.,cameras) to capture sensor data and/or generate a correspondingrepresentation of the environment (e.g., a projection image, a stitchedimage, an adaptive 3D bowl, a textured 3D bowl, a rendered view of oneof the foregoing). The audio system 3220 may use one or more microphonesto capture audio data and/or generate a corresponding representation ofthe environment (e.g., a representation of directional or surroundaudio). The salient event detector 3230 may detect instances ofsupported salient events (e.g., particular classes of audio or visualevents) and corresponding directions using the sensor data (e.g., usingone or more machine learning models). When the salient event detector3230 detects an instance of a salient event, the surround view system3210 may use the determined direction of the salient event to render avisualization of the environment through a viewport looking in thatdirection, the audio system 3220 may use the determined direction of thesalient event to emphasize audio from that direction. The streamtransport system 3240 may use any streaming technology to transport oneor more representations of the environment to the remote device 3270.The remote controlled maneuvering component 3250 may include remotecontrol functionality (e.g., NVRC) that allows a remote operator usingthe remote device 3270 to issue commands that maneuver the object 3201.

Continuing with a high level overview, the remote device 3270 mayinclude a remote experience application 3280, which may provide avariety of functionality. For example, depending on the embodiment, theremote experience application 3280 may provide one-way or two-way videoand/or audio streams between the object 3201 and the remote experienceapplication 3280 (e.g., video and/or feed(s) from outside or inside acockpit or vehicle cabin); rendering a 3D AR or VR representation of theenvironment in and/or around the object 3201 using the streamedrepresentation of the environment; providing one or one or more commandchannels to the representation streaming engine 3205 to control datacollection, rendering, streamed content, and/or maneuvers of the object3201 (e.g., during an emergency, self-park, or summon scenario); and/orother functionality.

In some embodiments, the remote experience application 3280 includes astream presentation component 3282, a viewport controller 3284, an audiocomponent 3286, a self-parking component 3292, and/or a summoningcomponent 3294. The stream presentation component 3282 may cause theremote device 3270 to display a representation of the environment basedon the streamed data received from the representation streaming engine3205 (e.g., raw sensor data, video feed(s), audio feed(s), an AR or VRvisualization of the environment). The viewport controller 3284 maycontrol a user interface that accepts a command from a remote operatorusing the remote device 3270 designating a viewport for rendering by thesurround view system 3210. The audio component 3286 may generatedirectional or surround audio, control audio playback, and/or steerdirectional or surround view audio generated by the audio system 3220 ofthe representation streaming engine 3205. The self-parking component3292 may control a user interface that accepts a command from a remoteoperator using the remote device 3270 instructing the remote controlledmaneuvering component 3250 to self-park the object 3201 (e.g., anautonomous or semi-autonomous vehicle). The summoning component 3294 maycontrol a user interface that accepts a command from a remote operatorusing the remote device 3270 instructing the remote controlledmaneuvering component 3250 to summon the object 3201 (e.g., anautonomous or semi-autonomous vehicle).

Taking an example embodiment in which the object 3201 is a vehicle witha cabin, the object 3201 may be equipped with external sensors capturinga representation of the outside environment and/or internal sensorscapturing a representation of the environment inside the cabin. In somecases, the stream transport system 3240 may stream raw sensor data(e.g., video feed(s), LiDAR data, RADAR data, audio narration).Additionally or alternatively, the stream transport system 3240 may somerendering or other representation of the raw sensor data. For example,the surround view system 3210 may project sensor data such as image orLiDAR data onto a 3D representation (e.g., a 3D bowl) and use a virtualcamera to render a view of the 3D representation in a 2D viewport. Insome embodiments, the stream transport system 3240 may transport the 3Drepresentation, the rendered 2D viewport, and/or some segment or portionthereof. The surround view system 3210 may set the position andorientation of the virtual camera in various ways, such as using aposition and orientation (or corresponding scenario) specified by aremote command received via the remote controlled maneuvering component3250 (e.g., enabling a remote person to control what perspective isrendered and/or streamed); using a position and orientationcorresponding to occupant gaze, head pose, or body pose (e.g., detectedusing a Driver Monitoring System (DMS) and/or an Occupant MonitoringSystem (OMS)); using a position and orientation corresponding to adriving scenario (e.g., parking, direction and/or speed of ego-motion);and/or otherwise.

Taking an example embodiment involving directional or surround audio,the object 3201 (e.g., a vehicle) may be affixed with multiplemicrophones around the object 3201. In some embodiments, the audiosystem 3220 may capture raw audio data. The audio system 3320 of therepresentation streaming engine 3205 may remove environmental noise(e.g., road noise) from audio data in various ways (e.g., using one ormore machine learning models to subtract noise from an audio signal). Insome embodiments, the stream transport system 3240 may stream the rawaudio data to some remote location (e.g., the audio component 3286 ofthe remote experience application 3280, a server) where directional orsurround audio may be computed. In some cases, the audio system 3220computes directional or surround audio, and the stream transport system3240 streams a representation of the directional or surround audio.

In some embodiments, the audio system 3220 may set the position andorientation of a soundport for the directional or surround audio invarious ways, such as using a position and orientation (or correspondingscenario) specified by a remote command (received via the audiocomponent 3286 of the remote experience application 3280) steering thedirectional or surround audio (e.g., enabling a remote person to controlwhat perspective is rendered and/or streamed). Additionally oralternatively, the salient event detector 3230 may analyze sensor datato detect instances of supported salient events (e.g., whetherimplemented in the representation streaming engine 3205, the remoteexperience application 3280, or elsewhere) and determine a correspondingdirection, and the determined direction be used to emphasize audio fromthat direction. For example, the salient event detector 3230 may includeone or more machine learning models (e.g., any known audio classifier)that analyze a captured audio signal to detect one or more supportedaudio classes representing a salient audio event (e.g., an emergencynoise or other salient audio event such as a vehicle horn, siren,screeching tires, a collision, and/or others). Upon detecting asupported salient audio event from a microphone pointing in a particulardirection, the audio system 3220 may steer the directional audio in thatdirection.

As such, the remote experience application 3280 (e.g., the audiocomponent 3286) may receive, compute, and/or playback directional orsurround audio, for example, to facilitate an immersive experience for aremote viewer or operator.

In some embodiments, the salient event detector 3230 may use othersensor data to detect an emergency or other salient event, and therepresentation streaming engine 3205 may use the detected salient eventto steer capturing or rendering of sensor data towards the direction ofthe detected salient event. For example, the salient event detector 3230may include one or more machine learning models that detect salientevents by analyzing a representation of image data, video data,proximity data, LiDAR data, RADAR data, and/or other sensor data; andupon detecting a salient event, the representation streaming engine 3205may steer directional audio and/or a viewport rendering of a 3Drepresentation of the environment towards the direction of the detectedevent, and the surround view system 3210 may present the directedviewport rendering on a monitor visible to occupants (e.g., driver) ofthe object 3201 (e.g., the vehicle). Additionally or alternatively, thestream transport system 3240 may stream the directional audio and/or thedirected viewport rendering to the remote experience application 3280,the stream presentation component 3282 may present the directed viewportrendering, and/or the audio component 3286 may play back the directionalaudio. In some embodiments, the directional audio and/or the directedviewport rendering may be presented in association with some alarm,presented picture-in-picture (e.g., with some other video feed such asone pointing in the direction of travel), and/or otherwise. In someembodiments, the remote controlled maneuvering component 3250 may use adetected salient event (e.g., an imminent collision) to triggercommandeering control of the object 3201 (e.g., the vehicle), avoiding acollision, and/or steering the object 3201 to safety.

In some embodiments, the stream transport system 3240 may triggergeneration and/or streaming of a representation of the environment invarious ways. In some embodiments, the stream transport system 3240triggers capture, rendering, and/or transport based on location of theobject 3201 (e.g., when a vehicle reaches a particular intersection,latitude/longitude, geofence), a detected salient event (e.g., anaccident, encountered emergency vehicle), a designated time, on demand(e.g., triggered by a command from a vehicle occupant, remote operator),and/or otherwise. The stream transport system 3240 may include one ormore command channels configured to deliver and trigger the remotecontrolled maneuvering component 3250 to execute remotely issuedcommands (e.g., via remote control functionality, such as NVRC).

The stream transport system 3240 may use any streaming technology totransport a representation(s) of the environment over the network 3260to a remote location (e.g., a remote server, the remote device 3270).For example, the stream transport system 3240 may include dedicatedcommunication channels for each type of content (e.g., one or more typesof sensor data, one or more types of rendered content, two-way audio,each type of command), may support scalable streams (e.g., usingscalable audio or video coders to adjust encoding quality based onbandwidth), and/or may implement a Quality of Service (QoS) mechanism toassign a priority to certain streamed content and/or commands and managethe stream accordingly. In some embodiments, the stream transport system3240 implements a stream hierarchy that prioritizes particular types ofcontent. For example, in some implementations that prioritize a renderedAR/VR visualization, the stream transport system 3240 may prioritize therendered AR/VR visualization for streaming and deprioritize or dropother sensor data such as LiDAR data (e.g., to conserve bandwidth). Inanother example, in some implementations that deliver content to amobile device, the stream transport system 3240 may drop LiDAR databecause the mobile device may lack the functionality to handle that typeof data. In some scenarios, the stream transport system 3240 may streamraw sensor data and drop all other content. In yet another example, thestream transport system 3240 may assign different priorities todifferent modalities of sensor data, and drop deprioritized sensor data.These are just a few examples, and other hierarchies are contemplatedwithin the scope of the present disclosure.

In an example remote user experience, a remote user operating the remoteexperience application 3280 may interact with one or more userinterfaces (e.g., graphical user interfaces) that accept inputidentifying (and issue commands configuring) a particular view forrendering, use the one or more user interfaces to interact with (e.g.,via two-way audio) an occupant or operator of the object 3201 (e.g., avehicle occupant), and/or use the one or more user interfaces to issuecommands steering a viewport, soundport, or sensorport (e.g., whetherrendering is performed by the representation streaming engine 3205, bythe remote experience application 3280, or at some other location). Inan example embodiment in which the object 3201 is a vehicle; the remoteexperience application 3280 may be mobile app, may control a display ormonitor visible to a vehicle operated by the friend, and/or may controlan AR or VR headset; and a vehicle occupant may use the representationstreaming engine 3205 to stream to a friend or remote operator using theremote experience application 3280. In some embodiments, multiplevehicles in a fleet may stream to a location where fleet monitoringoccurs.

In an autonomous or semi-autonomous driving application, the remoteexperience application 3280 may be a mobile app, and a user may operatethe remote experience application 3280 on his or her mobile device tomaneuver the object 3201 (e.g., his or her autonomous or semi-autonomousvehicle). For example, the remote experience application 3280 may beconfigured to communicate with the remote controlled maneuveringcomponent 3250 to initiate autonomous or semi-autonomous maneuvering(e.g., self-parking, summoning of the vehicle, traveling alongdesignated waypoints or a traced path, etc.), and the stream transportsystem 3240 may stream video or some other sensor data to the remoteexperience application 3280 so the user can monitor the object 3201while it is self-maneuvering. In another application, if there is afailure in autonomous maneuvering (e.g., a slow-down, stoppage in alane) or a catastrophic health event, the stream transport system 3240may connect to an emergency operator and transmit a video feed or othersensor data, enabling the emergency operator to assess the situation andpotentially take control and navigate the vehicle to safety via theremote controlled maneuvering component 3250.

In another example application, the remote experience application 3280may render a 3D (e.g., AR or VR) representation of a physical (e.g.,driving or navigational) environment with a digital twin or othervirtual representation of one or more ego-objects in the physicalenvironment. By way of non-limiting example, FIG. 33 is an example ARand/or VR system (AR/VR system) 3330, in accordance with someembodiments of the present disclosure. In this example, the AR/VR system3330 includes objects 3305 a-n (e.g., which may correspond to the object3201 of FIG. 32 ) in communication with an AR/VR system 3330 (e.g.,which may include or correspond to the remote device 3270 of FIG. 32 )via a network 3260 (e.g., which may correspond to the network 3260 ofFIG. 32 ). In this example, any number of objects 3305 a-n (e.g.,different vehicles) in the physical environment may includecorresponding representation stream engines 3310 a-n that sense, stream,and/or use position and/or orientation of the objects 3305 a-n to updatethe position and/or orientation of their corresponding virtualrepresentations (e.g., digital twins) in an AR/VR representation of thephysical environment, substantially in real-time. As such, the AR/VRsystem 3330 may render an AR/VR representation of the physicalenvironment that includes the digital twins and/or a visual or auditoryrepresentation of sensor data (e.g., image data, LiDAR data, RADAR data,temperature, audio) streamed from the objects 3305 a-n, and the AR/VRsystem 3330 may present the AR/VR representation of the physicalenvironment to a user outside the physical environment to provide animmersive experience representing the physical environment through whichthe objects 3305 a-n are traveling.

In the example illustrated in FIG. 33 , the AR/VR system 3330 includes aphysical environment simulator 3335 that generates a simulated AR/VRrepresentation of the physical environment using a digital twingenerator 3340, a 3D environment rendering component 3350, and/or anaudio component 3360. The digital twin generator 3340 may receivestreamed representations of positions and orientations of the objects3305 a-n from the objects 3305 a-n, and track, update, and managerepresentations of corresponding positions and orientations of digitaltwins representing the objects 3305 a-n in a corresponding 3Drepresentation of the physical environment. As such, the 3D environmentrendering component 3350 may render an AR and/or VR representation ofthe 3D environment that includes digital twins of the objects 3305 a-nat their streamed positions and orientations. In some embodiments, the3D environment rendering component 3350 renders an AR/VR representationof the physical environment that includes a visual representation ofstreamed sensor data (e.g., image data, LiDAR data, RADAR data,temperature) detected and streamed by one or more of the objects 3305a-n in the physical environment. Additionally or alternatively, theaudio component 3360 (e.g., which may correspond to the audio component3286 of FIG. 32 ) may playback audio (e.g., raw audio, directional orsurround audio) using streamed audio from the one or more of the objects3305 a-n. As such, the AR/VR system 3330 may render an AR or VRrepresentation of the physical environment (e.g., in a headset worn by aremote user) to provide an immersive experience representing thephysical environment through which the objects 3305 a-n are traveling.

Now referring to FIG. 34 , each block of method 3400, described herein,comprises a computing process that may be performed using anycombination of hardware, firmware, and/or software. For instance,various functions may be carried out by a processor executinginstructions stored in memory. The method may also be embodied ascomputer-usable instructions stored on computer storage media. Themethod may be provided by a standalone application, a service or hostedservice (standalone or in combination with another hosted service), or aplug-in to another product, to name a few. In addition, the method 3400is described, by way of example, with respect to the representationstreaming system 3200 of FIG. 32 . However, these methods mayadditionally or alternatively be executed by any one system, or anycombination of systems, including, but not limited to, those describedherein.

FIG. 34 is a flow diagram showing a method 3400 for streamingrepresentations of a physical environment, in accordance with someembodiments of the present disclosure. The method 3400, at block 3402,includes issuing a remote command to an ego-object in a physicalenvironment from a remote device in a remote location outside thephysical environment. For example, with respect to FIG. 32 , theviewport controller 3284 may control an interface of the remoteexperience application 3280 that accepts a command from a remoteoperator using the remote device 3270 designating a viewport forrendering by the surround view system 3210, and the viewport controller3284 may issue the command to the representation streaming engine 3205via the network 3260. In another example, the audio component 3286 mayissue a command steering directional or surround view audio generated bythe audio system 3220 of the representation streaming engine 3205. Inanother example, the self-parking component 3292 may control a userinterface that accepts and issues a command from a remote operator usingthe remote device 3270 instructing the remote controlled maneuveringcomponent 3250 to self-park the object 3201 (e.g., an autonomous orsemi-autonomous vehicle). In another example, the summoning component3294 may control a user interface that accepts and issues a command froma remote operator using the remote device 3270 instructing the remotecontrolled maneuvering component 3250 to summon the object 3201 (e.g.,an autonomous or semi-autonomous vehicle).

The method 3400, at block 3404, includes receiving, from the ego-object,by the remote device, a stream of representations of the physicalenvironment that were generated by the ego-object based at least on theremote command. For example, with respect to FIG. 32 , the streamtransport system 3240 may use any streaming technology to transport oneor more representations of the environment to the remote device 3270 viathe network 3260, and the stream presentation component 3282 may receivethe streamed data. The representations of the physical environmenttransported by the stream may include one or more types of data (e.g.,captured or generated by the representation streaming engine 3205, by aremote server) representing a particular time slice, and the stream mayinclude updated representations of the one or more types of data in eachsuccessive time slice.

The method 3400, at block 3406, includes causing the remote device topresent a visualization of the representations of the physicalenvironment. For example, with respect to FIG. 32 , the streampresentation component 3282 may cause the remote device 3270 to displaya representation of the environment based on the streamed data (e.g.,raw sensor data, video feed(s), audio feed(s), an AR or VR visualizationof the environment). In some embodiments, the stream presentationcomponent 3282 may present the representation of the environment inassociation with some alarm, presented picture-in-picture (e.g., withsome other video feed such as one pointing in the direction of travel),and/or otherwise. In another example, with respect to FIG. 33 , 3Denvironment rendering component 3350 may render an AR and/or VRrepresentation of the physical environment that includes a digital twinof the ego-object at a position (and orientation) included in therepresentations of the physical environment. These are just a fewexamples, and other implementations are contemplated within the scope ofthe present disclosure.

The systems and methods described herein may be used by, withoutlimitation, non-autonomous vehicles, semi-autonomous vehicles (e.g., inone or more advanced driver assistance systems (ADAS)), piloted andun-piloted robots or robotic platforms, warehouse vehicles, off-roadvehicles, vehicles coupled to one or more trailers, flying vessels,boats, shuttles, emergency response vehicles, motorcycles, electric ormotorized bicycles, aircraft, construction vehicles, underwater craft,drones, and/or other vehicle types. Further, the systems and methodsdescribed herein may be used for a variety of purposes, by way ofexample and without limitation, for machine control, machine locomotion,machine driving, synthetic data generation, model training, perception,augmented reality, virtual reality, mixed reality, robotics, securityand surveillance, simulation and digital twinning, autonomous orsemi-autonomous machine applications, deep learning, environmentsimulation, object or actor simulation and/or digital twinning, datacenter processing, conversational AI, light transport simulation (e.g.,ray-tracing, path tracing, etc.), collaborative content creation for 3Dassets, cloud computing and/or any other suitable applications.

Disclosed embodiments may be comprised in a variety of different systemssuch as automotive systems (e.g., a control system for an autonomous orsemi-autonomous machine, a perception system for an autonomous orsemi-autonomous machine), systems implemented using a robot, aerialsystems, medial systems, boating systems, smart area monitoring systems,systems for performing deep learning operations, systems for performingsimulation operations, systems for performing digital twin operations,systems implemented using an edge device, systems incorporating one ormore virtual machines (VMs), systems for performing synthetic datageneration operations, systems implemented at least partially in a datacenter, systems for performing conversational AI operations, systems forperforming light transport simulation, systems for performingcollaborative content creation for 3D assets, systems implemented atleast partially using cloud computing resources, and/or other types ofsystems.

Example Autonomous Vehicle

FIG. 35A is an illustration of an example autonomous vehicle 3500, inaccordance with some embodiments of the present disclosure. Theautonomous vehicle 3500 (alternatively referred to herein as the“vehicle 3500”) may include, without limitation, a passenger vehicle,such as a car, a truck, a bus, a first responder vehicle, a shuttle, anelectric or motorized bicycle, a motorcycle, a fire truck, a policevehicle, an ambulance, a boat, a construction vehicle, an underwatercraft, a robotic vehicle, a drone, an airplane, a vehicle coupled to atrailer (e.g., a semi-tractor-trailer truck used for hauling cargo),and/or another type of vehicle (e.g., that is unmanned and/or thataccommodates one or more passengers). Autonomous vehicles are generallydescribed in terms of automation levels, defined by the National HighwayTraffic Safety Administration (NHTSA), a division of the US Departmentof Transportation, and the Society of Automotive Engineers (SAE)“Taxonomy and Definitions for Terms Related to Driving AutomationSystems for On-Road Motor Vehicles” (Standard No. J3016-201806,published on Jun. 15, 2018, Standard No. J3016-201609, published on Sep.30, 2016, and previous and future versions of this standard). Thevehicle 3500 may be capable of functionality in accordance with one ormore of Level 3-Level 5 of the autonomous driving levels. The vehicle3500 may be capable of functionality in accordance with one or more ofLevel 1-Level 5 of the autonomous driving levels. For example, thevehicle 3500 may be capable of driver assistance (Level 1), partialautomation (Level 2), conditional automation (Level 3), high automation(Level 4), and/or full automation (Level 5), depending on theembodiment. The term “autonomous,” as used herein, may include anyand/or all types of autonomy for the vehicle 3500 or other machine, suchas being fully autonomous, being highly autonomous, being conditionallyautonomous, being partially autonomous, providing assistive autonomy,being semi-autonomous, being primarily autonomous, or other designation.

The vehicle 3500 may include components such as a chassis, a vehiclebody, wheels (e.g., 2, 4, 6, 8, 18, etc.), tires, axles, and othercomponents of a vehicle. The vehicle 3500 may include a propulsionsystem 3550, such as an internal combustion engine, hybrid electricpower plant, an all-electric engine, and/or another propulsion systemtype. The propulsion system 3550 may be connected to a drive train ofthe vehicle 3500, which may include a transmission, to enable thepropulsion of the vehicle 3500. The propulsion system 3550 may becontrolled in response to receiving signals from thethrottle/accelerator 3552.

A steering system 3554, which may include a steering wheel, may be usedto steer the vehicle 3500 (e.g., along a desired path or route) when thepropulsion system 3550 is operating (e.g., when the vehicle is inmotion). The steering system 3554 may receive signals from a steeringactuator 3556. The steering wheel may be optional for full automation(Level 5) functionality.

The brake sensor system 3546 may be used to operate the vehicle brakesin response to receiving signals from the brake actuators 3548 and/orbrake sensors.

Controller(s) 3536, which may include one or more system on chips (SoCs)3504 (FIG. 35C) and/or GPU(s), may provide signals (e.g., representativeof commands) to one or more components and/or systems of the vehicle3500. For example, the controller(s) may send signals to operate thevehicle brakes via one or more brake actuators 3548, to operate thesteering system 3554 via one or more steering actuators 3556, to operatethe propulsion system 3550 via one or more throttle/accelerators 3552.The controller(s) 3536 may include one or more onboard (e.g.,integrated) computing devices (e.g., supercomputers) that process sensorsignals, and output operation commands (e.g., signals representingcommands) to enable autonomous driving and/or to assist a human driverin driving the vehicle 3500. The controller(s) 3536 may include a firstcontroller 3536 for autonomous driving functions, a second controller3536 for functional safety functions, a third controller 3536 forartificial intelligence functionality (e.g., computer vision), a fourthcontroller 3536 for infotainment functionality, a fifth controller 3536for redundancy in emergency conditions, and/or other controllers. Insome examples, a single controller 3536 may handle two or more of theabove functionalities, two or more controllers 3536 may handle a singlefunctionality, and/or any combination thereof.

The controller(s) 3536 may provide the signals for controlling one ormore components and/or systems of the vehicle 3500 in response to sensordata received from one or more sensors (e.g., sensor inputs). The sensordata may be received from, for example and without limitation, globalnavigation satellite systems (“GNSS”) sensor(s) 3558 (e.g., GlobalPositioning System sensor(s)), RADAR sensor(s) 3560, ultrasonicsensor(s) 3562, LIDAR sensor(s) 3564, inertial measurement unit (IMU)sensor(s) 3566 (e.g., accelerometer(s), gyroscope(s), magneticcompass(es), magnetometer(s), etc.), microphone(s) 3596, stereocamera(s) 3568, wide-view camera(s) 3570 (e.g., fisheye cameras),infrared camera(s) 3572, surround camera(s) 3574 (e.g., 360° cameras),long-range and/or mid-range camera(s) 3598, speed sensor(s) 3544 (e.g.,for measuring the speed of the vehicle 3500), vibration sensor(s) 3542,steering sensor(s) 3540, brake sensor(s) (e.g., as part of the brakesensor system 3546), and/or other sensor types.

One or more of the controller(s) 3536 may receive inputs (e.g.,represented by input data) from an instrument cluster 3532 of thevehicle 3500 and provide outputs (e.g., represented by output data,display data, etc.) via a human-machine interface (HMI) display 3534, anaudible annunciator, a loudspeaker, and/or via other components of thevehicle 3500. The outputs may include information such as vehiclevelocity, speed, time, map data (e.g., the High Definition (“HD”) map3522 of FIG. 35C), location data (e.g., the vehicle's 3500 location,such as on a map), direction, location of other vehicles (e.g., anoccupancy grid), information about objects and status of objects asperceived by the controller(s) 3536, etc. For example, the HMI display3534 may display information about the presence of one or more objects(e.g., a street sign, caution sign, traffic light changing, etc.),and/or information about driving maneuvers the vehicle has made, ismaking, or will make (e.g., changing lanes now, taking exit 34B in twomiles, etc.).

The vehicle 3500 further includes a network interface 3524 which may useone or more wireless antenna(s) 3526 and/or modem(s) to communicate overone or more networks. For example, the network interface 3524 may becapable of communication over Long-Term Evolution (“LTE”), Wideband CodeDivision Multiple Access (“WCDMA”), Universal Mobile TelecommunicationsSystem (“UMTS”), Global System for Mobile communication (“GSM”),IMT-CDMA Multi-Carrier (“CDMA2000”), etc. The wireless antenna(s) 3526may also enable communication between objects in the environment (e.g.,vehicles, mobile devices, etc.), using local area network(s), such asBluetooth, Bluetooth Low Energy (“LE”), Z-Wave, ZigBee, etc., and/or lowpower wide-area network(s) (“LPWANs”), such as LoRaWAN, SigFox, etc.

FIG. 35B is an example of camera locations and fields of view for theexample autonomous vehicle 3500 of FIG. 35A, in accordance with someembodiments of the present disclosure. The cameras and respective fieldsof view are one example embodiment and are not intended to be limiting.For example, additional and/or alternative cameras may be includedand/or the cameras may be located at different locations on the vehicle3500.

The camera types for the cameras may include, but are not limited to,digital cameras that may be adapted for use with the components and/orsystems of the vehicle 3500. The camera(s) may operate at automotivesafety integrity level (ASIL) B and/or at another ASIL. The camera typesmay be capable of any image capture rate, such as 60 frames per second(fps), 120 fps, 240 fps, etc., depending on the embodiment. The camerasmay be capable of using rolling shutters, global shutters, another typeof shutter, or a combination thereof. In some examples, the color filterarray may include a red clear clear clear (RCCC) color filter array, ared clear clear blue (RCCB) color filter array, a red blue green clear(RBGC) color filter array, a Foveon X3 color filter array, a Bayersensors (RGGB) color filter array, a monochrome sensor color filterarray, and/or another type of color filter array. In some embodiments,clear pixel cameras, such as cameras with an RCCC, an RCCB, and/or anRBGC color filter array, may be used in an effort to increase lightsensitivity.

In some examples, one or more of the camera(s) may be used to performadvanced driver assistance systems (ADAS) functions (e.g., as part of aredundant or fail-safe design). For example, a Multi-Function MonoCamera may be installed to provide functions including lane departurewarning, traffic sign assist and intelligent headlamp control. One ormore of the camera(s) (e.g., all of the cameras) may record and provideimage data (e.g., video) simultaneously.

One or more of the cameras may be mounted in a mounting assembly, suchas a custom designed (three dimensional (“3D”) printed) assembly, inorder to cut out stray light and reflections from within the car (e.g.,reflections from the dashboard reflected in the windshield mirrors)which may interfere with the camera's image data capture abilities. Withreference to wing-mirror mounting assemblies, the wing-mirror assembliesmay be custom 3D printed so that the camera mounting plate matches theshape of the wing-mirror. In some examples, the camera(s) may beintegrated into the wing-mirror. For side-view cameras, the camera(s)may also be integrated within the four pillars at each corner of thecabin.

Cameras with a field of view that include portions of the environment infront of the vehicle 3500 (e.g., front-facing cameras) may be used forsurround view, to help identify forward facing paths and obstacles, aswell aid in, with the help of one or more controllers 3536 and/orcontrol SoCs, providing information critical to generating an occupancygrid and/or determining the preferred vehicle paths. Front-facingcameras may be used to perform many of the same ADAS functions as LIDAR,including emergency braking, pedestrian detection, and collisionavoidance. Front-facing cameras may also be used for ADAS functions andsystems including Lane Departure Warnings (“LDW”), Autonomous CruiseControl (“ACC”), and/or other functions such as traffic signrecognition.

A variety of cameras may be used in a front-facing configuration,including, for example, a monocular camera platform that includes acomplementary metal oxide semiconductor (“CMOS”) color imager. Anotherexample may be a wide-view camera(s) 3570 that may be used to perceiveobjects coming into view from the periphery (e.g., pedestrians, crossingtraffic or bicycles). Although only one wide-view camera is illustratedin FIG. 35B, there may be any number (including zero) of wide-viewcameras 3570 on the vehicle 3500. In addition, any number of long-rangecamera(s) 3598 (e.g., a long-view stereo camera pair) may be used fordepth-based object detection, especially for objects for which a neuralnetwork has not yet been trained. The long-range camera(s) 3598 may alsobe used for object detection and classification, as well as basic objecttracking.

Any number of stereo cameras 3568 may also be included in a front-facingconfiguration. In at least one embodiment, one or more of stereocamera(s) 3568 may include an integrated control unit comprising ascalable processing unit, which may provide a programmable logic(“FPGA”) and a multi-core micro-processor with an integrated ControllerArea Network (“CAN”) or Ethernet interface on a single chip. Such a unitmay be used to generate a 3D map of the vehicle's environment, includinga distance estimate for all the points in the image. An alternativestereo camera(s) 3568 may include a compact stereo vision sensor(s) thatmay include two camera lenses (one each on the left and right) and animage processing chip that may measure the distance from the vehicle tothe target object and use the generated information (e.g., metadata) toactivate the autonomous emergency braking and lane departure warningfunctions. Other types of stereo camera(s) 3568 may be used in additionto, or alternatively from, those described herein.

Cameras with a field of view that include portions of the environment tothe side of the vehicle 3500 (e.g., side-view cameras) may be used forsurround view, providing information used to create and update theoccupancy grid, as well as to generate side impact collision warnings.For example, surround camera(s) 3574 (e.g., four surround cameras 3574as illustrated in FIG. 35B) may be positioned to on the vehicle 3500.The surround camera(s) 3574 may include wide-view camera(s) 3570,fisheye camera(s), 360° camera(s), and/or the like. Four example, fourfisheye cameras may be positioned on the vehicle's front, rear, andsides. In an alternative arrangement, the vehicle may use three surroundcamera(s) 3574 (e.g., left, right, and rear), and may leverage one ormore other camera(s) (e.g., a forward-facing camera) as a fourthsurround view camera.

Cameras with a field of view that include portions of the environment tothe rear of the vehicle 3500 (e.g., rear-view cameras) may be used forpark assistance, surround view, rear collision warnings, and creatingand updating the occupancy grid. A wide variety of cameras may be usedincluding, but not limited to, cameras that are also suitable as afront-facing camera(s) (e.g., long-range and/or mid-range camera(s)3598, stereo camera(s) 3568), infrared camera(s) 3572, etc.), asdescribed herein.

FIG. 35C is a block diagram of an example system architecture for theexample autonomous vehicle 3500 of FIG. 35A, in accordance with someembodiments of the present disclosure. It should be understood that thisand other arrangements described herein are set forth only as examples.Other arrangements and elements (e.g., machines, interfaces, functions,orders, groupings of functions, etc.) may be used in addition to orinstead of those shown, and some elements may be omitted altogether.Further, many of the elements described herein are functional entitiesthat may be implemented as discrete or distributed components or inconjunction with other components, and in any suitable combination andlocation. Various functions described herein as being performed byentities may be carried out by hardware, firmware, and/or software. Forinstance, various functions may be carried out by a processor executinginstructions stored in memory.

Each of the components, features, and systems of the vehicle 3500 inFIG. 35C are illustrated as being connected via bus 3502. The bus 3502may include a Controller Area Network (CAN) data interface(alternatively referred to herein as a “CAN bus”). A CAN may be anetwork inside the vehicle 3500 used to aid in control of variousfeatures and functionality of the vehicle 3500, such as actuation ofbrakes, acceleration, braking, steering, windshield wipers, etc. A CANbus may be configured to have dozens or even hundreds of nodes, eachwith its own unique identifier (e.g., a CAN ID). The CAN bus may be readto find steering wheel angle, ground speed, engine revolutions perminute (RPMs), button positions, and/or other vehicle status indicators.The CAN bus may be ASIL B compliant.

Although the bus 3502 is described herein as being a CAN bus, this isnot intended to be limiting. For example, in addition to, oralternatively from, the CAN bus, FlexRay and/or Ethernet may be used.Additionally, although a single line is used to represent the bus 3502,this is not intended to be limiting. For example, there may be anynumber of busses 3502, which may include one or more CAN busses, one ormore FlexRay busses, one or more Ethernet busses, and/or one or moreother types of busses using a different protocol. In some examples, twoor more busses 3502 may be used to perform different functions, and/ormay be used for redundancy. For example, a first bus 3502 may be usedfor collision avoidance functionality and a second bus 3502 may be usedfor actuation control. In any example, each bus 3502 may communicatewith any of the components of the vehicle 3500, and two or more busses3502 may communicate with the same components. In some examples, eachSoC 3504, each controller 3536, and/or each computer within the vehiclemay have access to the same input data (e.g., inputs from sensors of thevehicle 3500), and may be connected to a common bus, such the CAN bus.

The vehicle 3500 may include one or more controller(s) 3536, such asthose described herein with respect to FIG. 35A. The controller(s) 3536may be used for a variety of functions. The controller(s) 3536 may becoupled to any of the various other components and systems of thevehicle 3500, and may be used for control of the vehicle 3500,artificial intelligence of the vehicle 3500, infotainment for thevehicle 3500, and/or the like.

The vehicle 3500 may include a system(s) on a chip (SoC) 3504. The SoC3504 may include CPU(s) 3506, GPU(s) 3508, processor(s) 3510, cache(s)3512, accelerator(s) 3514, data store(s) 3516, and/or other componentsand features not illustrated. The SoC(s) 3504 may be used to control thevehicle 3500 in a variety of platforms and systems. For example, theSoC(s) 3504 may be combined in a system (e.g., the system of the vehicle3500) with an HD map 3522 which may obtain map refreshes and/or updatesvia a network interface 3524 from one or more servers (e.g., server(s)3578 of FIG. 35D).

The CPU(s) 3506 may include a CPU cluster or CPU complex (alternativelyreferred to herein as a “CCPLEX”). The CPU(s) 3506 may include multiplecores and/or L2 caches. For example, in some embodiments, the CPU(s)3506 may include eight cores in a coherent multi-processorconfiguration. In some embodiments, the CPU(s) 3506 may include fourdual-core clusters where each cluster has a dedicated L2 cache (e.g., a2 MB L2 cache). The CPU(s) 3506 (e.g., the CCPLEX) may be configured tosupport simultaneous cluster operation enabling any combination of theclusters of the CPU(s) 3506 to be active at any given time.

The CPU(s) 3506 may implement power management capabilities that includeone or more of the following features: individual hardware blocks may beclock-gated automatically when idle to save dynamic power; each coreclock may be gated when the core is not actively executing instructionsdue to execution of WFI/WFE instructions; each core may be independentlypower-gated; each core cluster may be independently clock-gated when allcores are clock-gated or power-gated; and/or each core cluster may beindependently power-gated when all cores are power-gated. The CPU(s)3506 may further implement an enhanced algorithm for managing powerstates, where allowed power states and expected wakeup times arespecified, and the hardware/microcode determines the best power state toenter for the core, cluster, and CCPLEX. The processing cores maysupport simplified power state entry sequences in software with the workoffloaded to microcode.

The GPU(s) 3508 may include an integrated GPU (alternatively referred toherein as an “iGPU”). The GPU(s) 3508 may be programmable and may beefficient for parallel workloads. The GPU(s) 3508, in some examples, mayuse an enhanced tensor instruction set. The GPU(s) 3508 may include oneor more streaming microprocessors, where each streaming microprocessormay include an L1 cache (e.g., an L1 cache with at least 96 KB storagecapacity), and two or more of the streaming microprocessors may share anL2 cache (e.g., an L2 cache with a 512 KB storage capacity). In someembodiments, the GPU(s) 3508 may include at least eight streamingmicroprocessors. The GPU(s) 3508 may use compute application programminginterface(s) (API(s)). In addition, the GPU(s) 3508 may use one or moreparallel computing platforms and/or programming models (e.g., NVIDIA'sCUDA).

The GPU(s) 3508 may be power-optimized for best performance inautomotive and embedded use cases. For example, the GPU(s) 3508 may befabricated on a Fin field-effect transistor (FinFET). However, this isnot intended to be limiting and the GPU(s) 3508 may be fabricated usingother semiconductor manufacturing processes. Each streamingmicroprocessor may incorporate a number of mixed-precision processingcores partitioned into multiple blocks. For example, and withoutlimitation, 64 PF32 cores and 32 PF64 cores may be partitioned into fourprocessing blocks. In such an example, each processing block may beallocated 16 FP32 cores, 8 FP64 cores, 16 INT32 cores, twomixed-precision NVIDIA TENSOR COREs for deep learning matrix arithmetic,an L0 instruction cache, a warp scheduler, a dispatch unit, and/or a 64KB register file. In addition, the streaming microprocessors may includeindependent parallel integer and floating-point data paths to providefor efficient execution of workloads with a mix of computation andaddressing calculations. The streaming microprocessors may includeindependent thread scheduling capability to enable finer-grainsynchronization and cooperation between parallel threads. The streamingmicroprocessors may include a combined L1 data cache and shared memoryunit in order to improve performance while simplifying programming.

The GPU(s) 3508 may include a high bandwidth memory (HBM) and/or a 16 GBHBM2 memory subsystem to provide, in some examples, about 900 GB/secondpeak memory bandwidth. In some examples, in addition to, oralternatively from, the HBM memory, a synchronous graphics random-accessmemory (SGRAM) may be used, such as a graphics double data rate typefive synchronous random-access memory (GDDR5).

The GPU(s) 3508 may include unified memory technology including accesscounters to allow for more accurate migration of memory pages to theprocessor that accesses them most frequently, thereby improvingefficiency for memory ranges shared between processors. In someexamples, address translation services (ATS) support may be used toallow the GPU(s) 3508 to access the CPU(s) 3506 page tables directly. Insuch examples, when the GPU(s) 3508 memory management unit (MMU)experiences a miss, an address translation request may be transmitted tothe CPU(s) 3506. In response, the CPU(s) 3506 may look in its pagetables for the virtual-to-physical mapping for the address and transmitsthe translation back to the GPU(s) 3508. As such, unified memorytechnology may allow a single unified virtual address space for memoryof both the CPU(s) 3506 and the GPU(s) 3508, thereby simplifying theGPU(s) 3508 programming and porting of applications to the GPU(s) 3508.

In addition, the GPU(s) 3508 may include an access counter that may keeptrack of the frequency of access of the GPU(s) 3508 to memory of otherprocessors. The access counter may help ensure that memory pages aremoved to the physical memory of the processor that is accessing thepages most frequently.

The SoC(s) 3504 may include any number of cache(s) 3512, including thosedescribed herein. For example, the cache(s) 3512 may include an L3 cachethat is available to both the CPU(s) 3506 and the GPU(s) 3508 (e.g.,that is connected both the CPU(s) 3506 and the GPU(s) 3508). Thecache(s) 3512 may include a write-back cache that may keep track ofstates of lines, such as by using a cache coherence protocol (e.g., MEI,MESI, MSI, etc.). The L3 cache may include 4 MB or more, depending onthe embodiment, although smaller cache sizes may be used.

The SoC(s) 3504 may include an arithmetic logic unit(s) (ALU(s)) whichmay be leveraged in performing processing with respect to any of thevariety of tasks or operations of the vehicle 3500—such as processingDNNs. In addition, the SoC(s) 3504 may include a floating point unit(s)(FPU(s))—or other math coprocessor or numeric coprocessor types—forperforming mathematical operations within the system. For example, theSoC(s) 104 may include one or more FPUs integrated as execution unitswithin a CPU(s) 3506 and/or GPU(s) 3508.

The SoC(s) 3504 may include one or more accelerators 3514 (e.g.,hardware accelerators, software accelerators, or a combination thereof).For example, the SoC(s) 3504 may include a hardware acceleration clusterthat may include optimized hardware accelerators and/or large on-chipmemory. The large on-chip memory (e.g., 4 MB of SRAM), may enable thehardware acceleration cluster to accelerate neural networks and othercalculations. The hardware acceleration cluster may be used tocomplement the GPU(s) 3508 and to off-load some of the tasks of theGPU(s) 3508 (e.g., to free up more cycles of the GPU(s) 3508 forperforming other tasks). As an example, the accelerator(s) 3514 may beused for targeted workloads (e.g., perception, convolutional neuralnetworks (CNNs), etc.) that are stable enough to be amenable toacceleration. The term “CNN,” as used herein, may include all types ofCNNs, including region-based or regional convolutional neural networks(RCNNs) and Fast RCNNs (e.g., as used for object detection).

The accelerator(s) 3514 (e.g., the hardware acceleration cluster) mayinclude a deep learning accelerator(s) (DLA). The DLA(s) may include oneor more Tensor processing units (TPUs) that may be configured to providean additional ten trillion operations per second for deep learningapplications and inferencing. The TPUs may be accelerators configuredto, and optimized for, performing image processing functions (e.g., forCNNs, RCNNs, etc.). The DLA(s) may further be optimized for a specificset of neural network types and floating point operations, as well asinferencing. The design of the DLA(s) may provide more performance permillimeter than a general-purpose GPU, and vastly exceeds theperformance of a CPU. The TPU(s) may perform several functions,including a single-instance convolution function, supporting, forexample, INT8, INT16, and FP16 data types for both features and weights,as well as post-processor functions.

The DLA(s) may quickly and efficiently execute neural networks,especially CNNs, on processed or unprocessed data for any of a varietyof functions, including, for example and without limitation: a CNN forobject identification and detection using data from camera sensors; aCNN for distance estimation using data from camera sensors; a CNN foremergency vehicle detection and identification and detection using datafrom microphones; a CNN for facial recognition and vehicle owneridentification using data from camera sensors; and/or a CNN for securityand/or safety related events.

The DLA(s) may perform any function of the GPU(s) 3508, and by using aninference accelerator, for example, a designer may target either theDLA(s) or the GPU(s) 3508 for any function. For example, the designermay focus processing of CNNs and floating point operations on the DLA(s)and leave other functions to the GPU(s) 3508 and/or other accelerator(s)3514.

The accelerator(s) 3514 (e.g., the hardware acceleration cluster) mayinclude a programmable vision accelerator(s) (PVA), which mayalternatively be referred to herein as a computer vision accelerator.The PVA(s) may be designed and configured to accelerate computer visionalgorithms for the advanced driver assistance systems (ADAS), autonomousdriving, and/or augmented reality (AR) and/or virtual reality (VR)applications. The PVA(s) may provide a balance between performance andflexibility. For example, each PVA(s) may include, for example andwithout limitation, any number of reduced instruction set computer(RISC) cores, direct memory access (DMA), and/or any number of vectorprocessors.

The RISC cores may interact with image sensors (e.g., the image sensorsof any of the cameras described herein), image signal processor(s),and/or the like. Each of the RISC cores may include any amount ofmemory. The RISC cores may use any of a number of protocols, dependingon the embodiment. In some examples, the RISC cores may execute areal-time operating system (RTOS). The RISC cores may be implementedusing one or more integrated circuit devices, application specificintegrated circuits (ASICs), and/or memory devices. For example, theRISC cores may include an instruction cache and/or a tightly coupledRAM.

The DMA may enable components of the PVA(s) to access the system memoryindependently of the CPU(s) 3506. The DMA may support any number offeatures used to provide optimization to the PVA including, but notlimited to, supporting multi-dimensional addressing and/or circularaddressing. In some examples, the DMA may support up to six or moredimensions of addressing, which may include block width, block height,block depth, horizontal block stepping, vertical block stepping, and/ordepth stepping.

The vector processors may be programmable processors that may bedesigned to efficiently and flexibly execute programming for computervision algorithms and provide signal processing capabilities. In someexamples, the PVA may include a PVA core and two vector processingsubsystem partitions. The PVA core may include a processor subsystem,DMA engine(s) (e.g., two DMA engines), and/or other peripherals. Thevector processing subsystem may operate as the primary processing engineof the PVA, and may include a vector processing unit (VPU), aninstruction cache, and/or vector memory (e.g., VMEM). A VPU core mayinclude a digital signal processor such as, for example, a singleinstruction, multiple data (SIMD), very long instruction word (VLIW)digital signal processor. The combination of the SIMD and VLIW mayenhance throughput and speed.

Each of the vector processors may include an instruction cache and maybe couple d to dedicated memory. As a result, in some examples, each ofthe vector processors may be configured to execute independently of theother vector processors. In other examples, the vector processors thatare included in a particular PVA may be configured to employ dataparallelism. For example, in some embodiments, the plurality of vectorprocessors included in a single PVA may execute the same computer visionalgorithm, but on different regions of an image. In other examples, thevector processors included in a particular PVA may simultaneouslyexecute different computer vision algorithms, on the same image, or evenexecute different algorithms on sequential images or portions of animage. Among other things, any number of PVAs may be included in thehardware acceleration cluster and any number of vector processors may beincluded in each of the PVAs. In addition, the PVA(s) may includeadditional error correcting code (ECC) memory, to enhance overall systemsafety.

The accelerator(s) 3514 (e.g., the hardware acceleration cluster) mayinclude a computer vision network on-chip and SRAM, for providing ahigh-bandwidth, low latency SRAM for the accelerator(s) 3514. In someexamples, the on-chip memory may include at least 4 MB SRAM, consistingof, for example and without limitation, eight field-configurable memoryblocks, that may be accessible by both the PVA and the DLA. Each pair ofmemory blocks may include an advanced peripheral bus (APB) interface,configuration circuitry, a controller, and a multiplexer. Any type ofmemory may be used. The PVA and DLA may access the memory via a backbonethat provides the PVA and DLA with high-speed access to memory. Thebackbone may include a computer vision network on-chip thatinterconnects the PVA and the DLA to the memory (e.g., using the APB).

The computer vision network on-chip may include an interface thatdetermines, before transmission of any control signal/address/data, thatboth the PVA and the DLA provide ready and valid signals. Such aninterface may provide for separate phases and separate channels fortransmitting control signals/addresses/data, as well as burst-typecommunications for continuous data transfer. This type of interface maycomply with ISO 26262 or IEC 61508 standards, although other standardsand protocols may be used.

In some examples, the SoC(s) 3504 may include a real-time ray-tracinghardware accelerator, such as described in U.S. patent application Ser.No. 16/101,232, filed on Aug. 10, 2018. The real-time ray-tracinghardware accelerator may be used to quickly and efficiently determinethe positions and extents of objects (e.g., within a world model), togenerate real-time visualization simulations, for RADAR signalinterpretation, for sound propagation synthesis and/or analysis, forsimulation of SONAR systems, for general wave propagation simulation,for comparison to LIDAR data for purposes of localization and/or otherfunctions, and/or for other uses. In some embodiments, one or more treetraversal units (TTUs) may be used for executing one or more ray-tracingrelated operations.

The accelerator(s) 3514 (e.g., the hardware accelerator cluster) have awide array of uses for autonomous driving. The PVA may be a programmablevision accelerator that may be used for key processing stages in ADASand autonomous vehicles. The PVA's capabilities are a good match foralgorithmic domains needing predictable processing, at low power and lowlatency. In other words, the PVA performs well on semi-dense or denseregular computation, even on small data sets, which need predictablerun-times with low latency and low power. Thus, in the context ofplatforms for autonomous vehicles, the PVAs are designed to run classiccomputer vision algorithms, as they are efficient at object detectionand operating on integer math.

For example, according to one embodiment of the technology, the PVA isused to perform computer stereo vision. A semi-global matching-basedalgorithm may be used in some examples, although this is not intended tobe limiting. Many applications for Level 3-5 autonomous driving requiremotion estimation/stereo matching on-the-fly (e.g., structure frommotion, pedestrian recognition, lane detection, etc.). The PVA mayperform computer stereo vision function on inputs from two monocularcameras.

In some examples, the PVA may be used to perform dense optical flow.According to process raw RADAR data (e.g., using a 4D Fast FourierTransform) to provide Processed RADAR. In other examples, the PVA isused for time of flight depth processing, by processing raw time offlight data to provide processed time of flight data, for example.

The DLA may be used to run any type of network to enhance control anddriving safety, including for example, a neural network that outputs ameasure of confidence for each object detection. Such a confidence valuemay be interpreted as a probability, or as providing a relative “weight”of each detection compared to other detections. This confidence valueenables the system to make further decisions regarding which detectionsshould be considered as true positive detections rather than falsepositive detections. For example, the system may set a threshold valuefor the confidence and consider only the detections exceeding thethreshold value as true positive detections. In an automatic emergencybraking (AEB) system, false positive detections would cause the vehicleto automatically perform emergency braking, which is obviouslyundesirable. Therefore, only the most confident detections should beconsidered as triggers for AEB. The DLA may run a neural network forregressing the confidence value. The neural network may take as itsinput at least some subset of parameters, such as bounding boxdimensions, ground plane estimate obtained (e.g. from anothersubsystem), inertial measurement unit (IMU) sensor 3566 output thatcorrelates with the vehicle 3500 orientation, distance, 3D locationestimates of the object obtained from the neural network and/or othersensors (e.g., LIDAR sensor(s) 3564 or RADAR sensor(s) 3560), amongothers.

The SoC(s) 3504 may include data store(s) 3516 (e.g., memory). The datastore(s) 3516 may be on-chip memory of the SoC(s) 3504, which may storeneural networks to be executed on the GPU and/or the DLA. In someexamples, the data store(s) 3516 may be large enough in capacity tostore multiple instances of neural networks for redundancy and safety.The data store(s) 3512 may comprise L2 or L3 cache(s) 3512. Reference tothe data store(s) 3516 may include reference to the memory associatedwith the PVA, DLA, and/or other accelerator(s) 3514, as describedherein.

The SoC(s) 3504 may include one or more processor(s) 3510 (e.g.,embedded processors). The processor(s) 3510 may include a boot and powermanagement processor that may be a dedicated processor and subsystem tohandle boot power and management functions and related securityenforcement. The boot and power management processor may be a part ofthe SoC(s) 3504 boot sequence and may provide runtime power managementservices. The boot power and management processor may provide clock andvoltage programming, assistance in system low power state transitions,management of SoC(s) 3504 thermals and temperature sensors, and/ormanagement of the SoC(s) 3504 power states. Each temperature sensor maybe implemented as a ring-oscillator whose output frequency isproportional to temperature, and the SoC(s) 3504 may use thering-oscillators to detect temperatures of the CPU(s) 3506, GPU(s) 3508,and/or accelerator(s) 3514. If temperatures are determined to exceed athreshold, the boot and power management processor may enter atemperature fault routine and put the SoC(s) 3504 into a lower powerstate and/or put the vehicle 3500 into a chauffeur to safe stop mode(e.g., bring the vehicle 3500 to a safe stop).

The processor(s) 3510 may further include a set of embedded processorsthat may serve as an audio processing engine. The audio processingengine may be an audio subsystem that enables full hardware support formulti-channel audio over multiple interfaces, and a broad and flexiblerange of audio I/O interfaces. In some examples, the audio processingengine is a dedicated processor core with a digital signal processorwith dedicated RAM.

The processor(s) 3510 may further include an always on processor enginethat may provide necessary hardware features to support low power sensormanagement and wake use cases. The always on processor engine mayinclude a processor core, a tightly coupled RAM, supporting peripherals(e.g., timers and interrupt controllers), various I/O controllerperipherals, and routing logic.

The processor(s) 3510 may further include a safety cluster engine thatincludes a dedicated processor subsystem to handle safety management forautomotive applications. The safety cluster engine may include two ormore processor cores, a tightly coupled RAM, support peripherals (e.g.,timers, an interrupt controller, etc.), and/or routing logic. In asafety mode, the two or more cores may operate in a lockstep mode andfunction as a single core with comparison logic to detect anydifferences between their operations.

The processor(s) 3510 may further include a real-time camera engine thatmay include a dedicated processor subsystem for handling real-timecamera management.

The processor(s) 3510 may further include a high-dynamic range signalprocessor that may include an image signal processor that is a hardwareengine that is part of the camera processing pipeline.

The processor(s) 3510 may include a video image compositor that may be aprocessing block (e.g., implemented on a microprocessor) that implementsvideo post-processing functions needed by a video playback applicationto produce the final image for the player window. The video imagecompositor may perform lens distortion correction on wide-view camera(s)3570, surround camera(s) 3574, and/or on in-cabin monitoring camerasensors. In-cabin monitoring camera sensor is preferably monitored by aneural network running on another instance of the Advanced SoC,configured to identify in cabin events and respond accordingly. Anin-cabin system may perform lip reading to activate cellular service andplace a phone call, dictate emails, change the vehicle's destination,activate or change the vehicle's infotainment system and settings, orprovide voice-activated web surfing. Certain functions are available tothe driver only when the vehicle is operating in an autonomous mode, andare disabled otherwise.

The video image compositor may include enhanced temporal noise reductionfor both spatial and temporal noise reduction. For example, where motionoccurs in a video, the noise reduction weights spatial informationappropriately, decreasing the weight of information provided by adjacentframes. Where an image or portion of an image does not include motion,the temporal noise reduction performed by the video image compositor mayuse information from the previous image to reduce noise in the currentimage.

The video image compositor may also be configured to perform stereorectification on input stereo lens frames. The video image compositormay further be used for user interface composition when the operatingsystem desktop is in use, and the GPU(s) 3508 is not required tocontinuously render new surfaces. Even when the GPU(s) 3508 is poweredon and active doing 3D rendering, the video image compositor may be usedto offload the GPU(s) 3508 to improve performance and responsiveness.

The SoC(s) 3504 may further include a mobile industry processorinterface (MIPI) camera serial interface for receiving video and inputfrom cameras, a high-speed interface, and/or a video input block thatmay be used for camera and related pixel input functions. The SoC(s)3504 may further include an input/output controller(s) that may becontrolled by software and may be used for receiving I/O signals thatare uncommitted to a specific role.

The SoC(s) 3504 may further include a broad range of peripheralinterfaces to enable communication with peripherals, audio codecs, powermanagement, and/or other devices. The SoC(s) 3504 may be used to processdata from cameras (e.g., connected over Gigabit Multimedia Serial Linkand Ethernet), sensors (e.g., LIDAR sensor(s) 3564, RADAR sensor(s)3560, etc. that may be connected over Ethernet), data from bus 3502(e.g., speed of vehicle 3500, steering wheel position, etc.), data fromGNSS sensor(s) 3558 (e.g., connected over Ethernet or CAN bus). TheSoC(s) 3504 may further include dedicated high-performance mass storagecontrollers that may include their own DMA engines, and that may be usedto free the CPU(s) 3506 from routine data management tasks.

The SoC(s) 3504 may be an end-to-end platform with a flexiblearchitecture that spans automation levels 3-5, thereby providing acomprehensive functional safety architecture that leverages and makesefficient use of computer vision and ADAS techniques for diversity andredundancy, provides a platform for a flexible, reliable drivingsoftware stack, along with deep learning tools. The SoC(s) 3504 may befaster, more reliable, and even more energy-efficient andspace-efficient than conventional systems. For example, theaccelerator(s) 3514, when combined with the CPU(s) 3506, the GPU(s)3508, and the data store(s) 3516, may provide for a fast, efficientplatform for level 3-5 autonomous vehicles.

The technology thus provides capabilities and functionality that cannotbe achieved by conventional systems. For example, computer visionalgorithms may be executed on CPUs, which may be configured using a highlevel programming language, such as the C programming language, toexecute a wide variety of processing algorithms across a wide variety ofvisual data. However, CPUs are oftentimes unable to meet the performancerequirements of many computer vision applications, such as those relatedto execution time and power consumption, for example. In particular,many CPUs are unable to execute complex object detection algorithms inreal-time, which is a requirement of in-vehicle ADAS applications, and arequirement for practical Level 3-5 autonomous vehicles.

In contrast to conventional systems, by providing a CPU complex, GPUcomplex, and a hardware acceleration cluster, the technology describedherein allows for multiple neural networks to be performedsimultaneously and/or sequentially, and for the results to be combinedtogether to enable Level 3-5 autonomous driving functionality. Forexample, a CNN executing on the DLA or dGPU (e.g., the GPU(s) 3520) mayinclude a text and word recognition, allowing the supercomputer to readand understand traffic signs, including signs for which the neuralnetwork has not been specifically trained. The DLA may further include aneural network that is able to identify, interpret, and providessemantic understanding of the sign, and to pass that semanticunderstanding to the path planning modules running on the CPU Complex.

As another example, multiple neural networks may be run simultaneously,as is required for Level 3, 4, or 5 driving. For example, a warning signconsisting of “Caution: flashing lights indicate icy conditions,” alongwith an electric light, may be independently or collectively interpretedby several neural networks. The sign itself may be identified as atraffic sign by a first deployed neural network (e.g., a neural networkthat has been trained), the text “Flashing lights indicate icyconditions” may be interpreted by a second deployed neural network,which informs the vehicle's path planning software (preferably executingon the CPU Complex) that when flashing lights are detected, icyconditions exist. The flashing light may be identified by operating athird deployed neural network over multiple frames, informing thevehicle's path-planning software of the presence (or absence) offlashing lights. All three neural networks may run simultaneously, suchas within the DLA and/or on the GPU(s) 3508.

In some examples, a CNN for facial recognition and vehicle owneridentification may use data from camera sensors to identify the presenceof an authorized driver and/or owner of the vehicle 3500. The always onsensor processing engine may be used to unlock the vehicle when theowner approaches the driver door and turn on the lights, and, insecurity mode, to disable the vehicle when the owner leaves the vehicle.In this way, the SoC(s) 3504 provide for security against theft and/orcarjacking.

In another example, a CNN for emergency vehicle detection andidentification may use data from microphones 3596 to detect and identifyemergency vehicle sirens. In contrast to conventional systems, that usegeneral classifiers to detect sirens and manually extract features, theSoC(s) 3504 use the CNN for classifying environmental and urban sounds,as well as classifying visual data. In a preferred embodiment, the CNNrunning on the DLA is trained to identify the relative closing speed ofthe emergency vehicle (e.g., by using the Doppler Effect). The CNN mayalso be trained to identify emergency vehicles specific to the localarea in which the vehicle is operating, as identified by GNSS sensor(s)3558. Thus, for example, when operating in Europe the CNN will seek todetect European sirens, and when in the United States the CNN will seekto identify only North American sirens. Once an emergency vehicle isdetected, a control program may be used to execute an emergency vehiclesafety routine, slowing the vehicle, pulling over to the side of theroad, parking the vehicle, and/or idling the vehicle, with theassistance of ultrasonic sensors 3562, until the emergency vehicle(s)passes.

The vehicle may include a CPU(s) 3518 (e.g., discrete CPU(s), ordCPU(s)), that may be coupled to the SoC(s) 3504 via a high-speedinterconnect (e.g., PCIe). The CPU(s) 3518 may include an X86 processor,for example. The CPU(s) 3518 may be used to perform any of a variety offunctions, including arbitrating potentially inconsistent resultsbetween ADAS sensors and the SoC(s) 3504, and/or monitoring the statusand health of the controller(s) 3536 and/or infotainment SoC 3530, forexample.

The vehicle 3500 may include a GPU(s) 3520 (e.g., discrete GPU(s), ordGPU(s)), that may be coupled to the SoC(s) 3504 via a high-speedinterconnect (e.g., NVIDIA's NVLINK). The GPU(s) 3520 may provideadditional artificial intelligence functionality, such as by executingredundant and/or different neural networks, and may be used to trainand/or update neural networks based on input (e.g., sensor data) fromsensors of the vehicle 3500.

The vehicle 3500 may further include the network interface 3524 whichmay include one or more wireless antennas 3526 (e.g., one or morewireless antennas for different communication protocols, such as acellular antenna, a Bluetooth antenna, etc.). The network interface 3524may be used to enable wireless connectivity over the Internet with thecloud (e.g., with the server(s) 3578 and/or other network devices), withother vehicles, and/or with computing devices (e.g., client devices ofpassengers). To communicate with other vehicles, a direct link may beestablished between the two vehicles and/or an indirect link may beestablished (e.g., across networks and over the Internet). Direct linksmay be provided using a vehicle-to-vehicle communication link. Thevehicle-to-vehicle communication link may provide the vehicle 3500information about vehicles in proximity to the vehicle 3500 (e.g.,vehicles in front of, on the side of, and/or behind the vehicle 3500).This functionality may be part of a cooperative adaptive cruise controlfunctionality of the vehicle 3500.

The network interface 3524 may include a SoC that provides modulationand demodulation functionality and enables the controller(s) 3536 tocommunicate over wireless networks. The network interface 3524 mayinclude a radio frequency front-end for up-conversion from baseband toradio frequency, and down conversion from radio frequency to baseband.The frequency conversions may be performed through well-known processes,and/or may be performed using super-heterodyne processes. In someexamples, the radio frequency front end functionality may be provided bya separate chip. The network interface may include wirelessfunctionality for communicating over LTE, WCDMA, UMTS, GSM, CDMA2000,Bluetooth, Bluetooth LE, Wi-Fi, Z-Wave, ZigBee, LoRaWAN, and/or otherwireless protocols.

The vehicle 3500 may further include data store(s) 3528 which mayinclude off-chip (e.g., off the SoC(s) 3504) storage. The data store(s)3528 may include one or more storage elements including RAM, SRAM, DRAM,VRAM, Flash, hard disks, and/or other components and/or devices that maystore at least one bit of data.

The vehicle 3500 may further include GNSS sensor(s) 3558. The GNSSsensor(s) 3558 (e.g., GPS, assisted GPS sensors, differential GPS (DGPS)sensors, etc.), to assist in mapping, perception, occupancy gridgeneration, and/or path planning functions. Any number of GNSS sensor(s)3558 may be used, including, for example and without limitation, a GPSusing a USB connector with an Ethernet to Serial (RS-232) bridge.

The vehicle 3500 may further include RADAR sensor(s) 3560. The RADARsensor(s) 3560 may be used by the vehicle 3500 for long-range vehicledetection, even in darkness and/or severe weather conditions. RADARfunctional safety levels may be ASIL B. The RADAR sensor(s) 3560 may usethe CAN and/or the bus 3502 (e.g., to transmit data generated by theRADAR sensor(s) 3560) for control and to access object tracking data,with access to Ethernet to access raw data in some examples. A widevariety of RADAR sensor types may be used. For example, and withoutlimitation, the RADAR sensor(s) 3560 may be suitable for front, rear,and side RADAR use. In some example, Pulse Doppler RADAR sensor(s) areused.

The RADAR sensor(s) 3560 may include different configurations, such aslong range with narrow field of view, short range with wide field ofview, short range side coverage, etc. In some examples, long-range RADARmay be used for adaptive cruise control functionality. The long-rangeRADAR systems may provide a broad field of view realized by two or moreindependent scans, such as within a 250 m range. The RADAR sensor(s)3560 may help in distinguishing between static and moving objects, andmay be used by ADAS systems for emergency brake assist and forwardcollision warning. Long-range RADAR sensors may include monostaticmultimodal RADAR with multiple (e.g., six or more) fixed RADAR antennaeand a high-speed CAN and FlexRay interface. In an example with sixantennae, the central four antennae may create a focused beam pattern,designed to record the vehicle's 3500 surroundings at higher speeds withminimal interference from traffic in adjacent lanes. The other twoantennae may expand the field of view, making it possible to quicklydetect vehicles entering or leaving the vehicle's 3500 lane.

Mid-range RADAR systems may include, as an example, a range of up to3560 m (front) or 80 m (rear), and a field of view of up to 42 degrees(front) or 3550 degrees (rear). Short-range RADAR systems may include,without limitation, RADAR sensors designed to be installed at both endsof the rear bumper. When installed at both ends of the rear bumper, sucha RADAR sensor systems may create two beams that constantly monitor theblind spot in the rear and next to the vehicle.

Short-range RADAR systems may be used in an ADAS system for blind spotdetection and/or lane change assist.

The vehicle 3500 may further include ultrasonic sensor(s) 3562. Theultrasonic sensor(s) 3562, which may be positioned at the front, back,and/or the sides of the vehicle 3500, may be used for park assist and/orto create and update an occupancy grid. A wide variety of ultrasonicsensor(s) 3562 may be used, and different ultrasonic sensor(s) 3562 maybe used for different ranges of detection (e.g., 2.5 m, 4 m). Theultrasonic sensor(s) 3562 may operate at functional safety levels ofASIL B.

The vehicle 3500 may include LIDAR sensor(s) 3564. The LIDAR sensor(s)3564 may be used for object and pedestrian detection, emergency braking,collision avoidance, and/or other functions. The LIDAR sensor(s) 3564may be functional safety level ASIL B. In some examples, the vehicle3500 may include multiple LIDAR sensors 3564 (e.g., two, four, six,etc.) that may use Ethernet (e.g., to provide data to a Gigabit Ethernetswitch).

In some examples, the LIDAR sensor(s) 3564 may be capable of providing alist of objects and their distances for a 360° field of view.Commercially available LIDAR sensor(s) 3564 may have an advertised rangeof approximately 3500 m, with an accuracy of 2 cm-3 cm, and with supportfor a 3500 Mbps Ethernet connection, for example. In some examples, oneor more non-protruding LIDAR sensors 3564 may be used. In such examples,the LIDAR sensor(s) 3564 may be implemented as a small device that maybe embedded into the front, rear, sides, and/or corners of the vehicle3500. The LIDAR sensor(s) 3564, in such examples, may provide up to a120-degree horizontal and 35-degree vertical field-of-view, with a 200 mrange even for low-reflectivity objects. Front-mounted LIDAR sensor(s)3564 may be configured for a horizontal field of view between 45 degreesand 135 degrees.

In some examples, LIDAR technologies, such as 3D flash LIDAR, may alsobe used. 3D Flash LIDAR uses a flash of a laser as a transmissionsource, to illuminate vehicle surroundings up to approximately 200 m. Aflash LIDAR unit includes a receptor, which records the laser pulsetransit time and the reflected light on each pixel, which in turncorresponds to the range from the vehicle to the objects. Flash LIDARmay allow for highly accurate and distortion-free images of thesurroundings to be generated with every laser flash. In some examples,four flash LIDAR sensors may be deployed, one at each side of thevehicle 3500. Available 3D flash LIDAR systems include a solid-state 3Dstaring array LIDAR camera with no moving parts other than a fan (e.g.,a non-scanning LIDAR device). The flash LIDAR device may use a 5nanosecond class I (eye-safe) laser pulse per frame and may capture thereflected laser light in the form of 3D range point clouds andco-registered intensity data. By using flash LIDAR, and because flashLIDAR is a solid-state device with no moving parts, the LIDAR sensor(s)3564 may be less susceptible to motion blur, vibration, and/or shock.

The vehicle may further include IMU sensor(s) 3566. The IMU sensor(s)3566 may be located at a center of the rear axle of the vehicle 3500, insome examples. The IMU sensor(s) 3566 may include, for example andwithout limitation, an accelerometer(s), a magnetometer(s), agyroscope(s), a magnetic compass(es), and/or other sensor types. In someexamples, such as in six-axis applications, the IMU sensor(s) 3566 mayinclude accelerometers and gyroscopes, while in nine-axis applications,the IMU sensor(s) 3566 may include accelerometers, gyroscopes, andmagnetometers.

In some embodiments, the IMU sensor(s) 3566 may be implemented as aminiature, high performance GPS-Aided Inertial Navigation System(GPS/INS) that combines micro-electro-mechanical systems (MEMS) inertialsensors, a high-sensitivity GPS receiver, and advanced Kalman filteringalgorithms to provide estimates of position, velocity, and attitude. Assuch, in some examples, the IMU sensor(s) 3566 may enable the vehicle3500 to estimate heading without requiring input from a magnetic sensorby directly observing and correlating the changes in velocity from GPSto the IMU sensor(s) 3566. In some examples, the IMU sensor(s) 3566 andthe GNSS sensor(s) 3558 may be combined in a single integrated unit.

The vehicle may include microphone(s) 3596 placed in and/or around thevehicle 3500. The microphone(s) 3596 may be used for emergency vehicledetection and identification, among other things.

The vehicle may further include any number of camera types, includingstereo camera(s) 3568, wide-view camera(s) 3570, infrared camera(s)3572, surround camera(s) 3574, long-range and/or mid-range camera(s)3598, and/or other camera types. The cameras may be used to captureimage data around an entire periphery of the vehicle 3500. The types ofcameras used depends on the embodiments and requirements for the vehicle3500, and any combination of camera types may be used to provide thenecessary coverage around the vehicle 3500. In addition, the number ofcameras may differ depending on the embodiment. For example, the vehiclemay include six cameras, seven cameras, ten cameras, twelve cameras,and/or another number of cameras. The cameras may support, as an exampleand without limitation, Gigabit Multimedia Serial Link (GMSL) and/orGigabit Ethernet. Each of the camera(s) is described with more detailherein with respect to FIG. 35A and FIG. 35B.

The vehicle 3500 may further include vibration sensor(s) 3542. Thevibration sensor(s) 3542 may measure vibrations of components of thevehicle, such as the axle(s). For example, changes in vibrations mayindicate a change in road surfaces. In another example, when two or morevibration sensors 3542 are used, the differences between the vibrationsmay be used to determine friction or slippage of the road surface (e.g.,when the difference in vibration is between a power-driven axle and afreely rotating axle).

The vehicle 3500 may include an ADAS system 3538. The ADAS system 3538may include a SoC, in some examples. The ADAS system 3538 may includeautonomous/adaptive/automatic cruise control (ACC), cooperative adaptivecruise control (CACC), forward crash warning (FCW), automatic emergencybraking (AEB), lane departure warnings (LDW), lane keep assist (LKA),blind spot warning (BSW), rear cross-traffic warning (RCTW), collisionwarning systems (CWS), lane centering (LC), and/or other features andfunctionality.

The ACC systems may use RADAR sensor(s) 3560, LIDAR sensor(s) 3564,and/or a camera(s). The ACC systems may include longitudinal ACC and/orlateral ACC. Longitudinal ACC monitors and controls the distance to thevehicle immediately ahead of the vehicle 3500 and automatically adjustthe vehicle speed to maintain a safe distance from vehicles ahead.Lateral ACC performs distance keeping, and advises the vehicle 3500 tochange lanes when necessary. Lateral ACC is related to other ADASapplications such as LCA and CWS.

CACC uses information from other vehicles that may be received via thenetwork interface 3524 and/or the wireless antenna(s) 3526 from othervehicles via a wireless link, or indirectly, over a network connection(e.g., over the Internet). Direct links may be provided by avehicle-to-vehicle (V2V) communication link, while indirect links may beinfrastructure-to-vehicle (I2V) communication link. In general, the V2Vcommunication concept provides information about the immediatelypreceding vehicles (e.g., vehicles immediately ahead of and in the samelane as the vehicle 3500), while the I2V communication concept providesinformation about traffic further ahead. CACC systems may include eitheror both I2V and V2V information sources. Given the information of thevehicles ahead of the vehicle 3500, CACC may be more reliable and it haspotential to improve traffic flow smoothness and reduce congestion onthe road.

FCW systems are designed to alert the driver to a hazard, so that thedriver may take corrective action. FCW systems use a front-facing cameraand/or RADAR sensor(s) 3560, coupled to a dedicated processor, DSP,FPGA, and/or ASIC, that is electrically coupled to driver feedback, suchas a display, speaker, and/or vibrating component. FCW systems mayprovide a warning, such as in the form of a sound, visual warning,vibration and/or a quick brake pulse.

AEB systems detect an impending forward collision with another vehicleor other object, and may automatically apply the brakes if the driverdoes not take corrective action within a specified time or distanceparameter. AEB systems may use front-facing camera(s) and/or RADARsensor(s) 3560, coupled to a dedicated processor, DSP, FPGA, and/orASIC. When the AEB system detects a hazard, it typically first alertsthe driver to take corrective action to avoid the collision and, if thedriver does not take corrective action, the AEB system may automaticallyapply the brakes in an effort to prevent, or at least mitigate, theimpact of the predicted collision. AEB systems, may include techniquessuch as dynamic brake support and/or crash imminent braking.

LDW systems provide visual, audible, and/or tactile warnings, such assteering wheel or seat vibrations, to alert the driver when the vehicle3500 crosses lane markings. A LDW system does not activate when thedriver indicates an intentional lane departure, by activating a turnsignal. LDW systems may use front-side facing cameras, coupled to adedicated processor, DSP, FPGA, and/or ASIC, that is electricallycoupled to driver feedback, such as a display, speaker, and/or vibratingcomponent.

LKA systems are a variation of LDW systems. LKA systems provide steeringinput or braking to correct the vehicle 3500 if the vehicle 3500 startsto exit the lane.

BSW systems detects and warn the driver of vehicles in an automobile'sblind spot. BSW systems may provide a visual, audible, and/or tactilealert to indicate that merging or changing lanes is unsafe. The systemmay provide an additional warning when the driver uses a turn signal.BSW systems may use rear-side facing camera(s) and/or RADAR sensor(s)3560, coupled to a dedicated processor, DSP, FPGA, and/or ASIC, that iselectrically coupled to driver feedback, such as a display, speaker,and/or vibrating component.

RCTW systems may provide visual, audible, and/or tactile notificationwhen an object is detected outside the rear-camera range when thevehicle 3500 is backing up. Some RCTW systems include AEB to ensure thatthe vehicle brakes are applied to avoid a crash. RCTW systems may useone or more rear-facing RADAR sensor(s) 3560, coupled to a dedicatedprocessor, DSP, FPGA, and/or ASIC, that is electrically coupled todriver feedback, such as a display, speaker, and/or vibrating component.

Conventional ADAS systems may be prone to false positive results whichmay be annoying and distracting to a driver, but typically are notcatastrophic, because the ADAS systems alert the driver and allow thedriver to decide whether a safety condition truly exists and actaccordingly. However, in an autonomous vehicle 3500, the vehicle 3500itself must, in the case of conflicting results, decide whether to heedthe result from a primary computer or a secondary computer (e.g., afirst controller 3536 or a second controller 3536). For example, in someembodiments, the ADAS system 3538 may be a backup and/or secondarycomputer for providing perception information to a backup computerrationality module. The backup computer rationality monitor may run aredundant diverse software on hardware components to detect faults inperception and dynamic driving tasks. Outputs from the ADAS system 3538may be provided to a supervisory MCU. If outputs from the primarycomputer and the secondary computer conflict, the supervisory MCU mustdetermine how to reconcile the conflict to ensure safe operation.

In some examples, the primary computer may be configured to provide thesupervisory MCU with a confidence score, indicating the primarycomputer's confidence in the chosen result. If the confidence scoreexceeds a threshold, the supervisory MCU may follow the primarycomputer's direction, regardless of whether the secondary computerprovides a conflicting or inconsistent result. Where the confidencescore does not meet the threshold, and where the primary and secondarycomputer indicate different results (e.g., the conflict), thesupervisory MCU may arbitrate between the computers to determine theappropriate outcome.

The supervisory MCU may be configured to run a neural network(s) that istrained and configured to determine, based on outputs from the primarycomputer and the secondary computer, conditions under which thesecondary computer provides false alarms. Thus, the neural network(s) inthe supervisory MCU may learn when the secondary computer's output maybe trusted, and when it cannot. For example, when the secondary computeris a RADAR-based FCW system, a neural network(s) in the supervisory MCUmay learn when the FCW system is identifying metallic objects that arenot, in fact, hazards, such as a drainage grate or manhole cover thattriggers an alarm. Similarly, when the secondary computer is acamera-based LDW system, a neural network in the supervisory MCU maylearn to override the LDW when bicyclists or pedestrians are present anda lane departure is, in fact, the safest maneuver. In embodiments thatinclude a neural network(s) running on the supervisory MCU, thesupervisory MCU may include at least one of a DLA or GPU suitable forrunning the neural network(s) with associated memory. In preferredembodiments, the supervisory MCU may comprise and/or be included as acomponent of the SoC(s) 3504.

In other examples, ADAS system 3538 may include a secondary computerthat performs ADAS functionality using traditional rules of computervision. As such, the secondary computer may use classic computer visionrules (if-then), and the presence of a neural network(s) in thesupervisory MCU may improve reliability, safety and performance. Forexample, the diverse implementation and intentional non-identity makesthe overall system more fault-tolerant, especially to faults caused bysoftware (or software-hardware interface) functionality. For example, ifthere is a software bug or error in the software running on the primarycomputer, and the non-identical software code running on the secondarycomputer provides the same overall result, the supervisory MCU may havegreater confidence that the overall result is correct, and the bug insoftware or hardware on primary computer is not causing material error.

In some examples, the output of the ADAS system 3538 may be fed into theprimary computer's perception block and/or the primary computer'sdynamic driving task block. For example, if the ADAS system 3538indicates a forward crash warning due to an object immediately ahead,the perception block may use this information when identifying objects.In other examples, the secondary computer may have its own neuralnetwork which is trained and thus reduces the risk of false positives,as described herein.

The vehicle 3500 may further include the infotainment SoC 3530 (e.g., anin-vehicle infotainment system (IVI)). Although illustrated anddescribed as a SoC, the infotainment system may not be a SoC, and mayinclude two or more discrete components. The infotainment SoC 3530 mayinclude a combination of hardware and software that may be used toprovide audio (e.g., music, a personal digital assistant, navigationalinstructions, news, radio, etc.), video (e.g., TV, movies, streaming,etc.), phone (e.g., hands-free calling), network connectivity (e.g.,LTE, Wi-Fi, etc.), and/or information services (e.g., navigationsystems, rear-parking assistance, a radio data system, vehicle relatedinformation such as fuel level, total distance covered, brake fuellevel, oil level, door open/close, air filter information, etc.) to thevehicle 3500. For example, the infotainment SoC 3530 may radios, diskplayers, navigation systems, video players, USB and Bluetoothconnectivity, carputers, in-car entertainment, Wi-Fi, steering wheelaudio controls, hands free voice control, a heads-up display (HUD), anHMI display 3534, a telematics device, a control panel (e.g., forcontrolling and/or interacting with various components, features, and/orsystems), and/or other components. The infotainment SoC 3530 may furtherbe used to provide information (e.g., visual and/or audible) to auser(s) of the vehicle, such as information from the ADAS system 3538,autonomous driving information such as planned vehicle maneuvers,trajectories, surrounding environment information (e.g., intersectioninformation, vehicle information, road information, etc.), and/or otherinformation.

The infotainment SoC 3530 may include GPU functionality. Theinfotainment SoC 3530 may communicate over the bus 3502 (e.g., CAN bus,Ethernet, etc.) with other devices, systems, and/or components of thevehicle 3500. In some examples, the infotainment SoC 3530 may be coupledto a supervisory MCU such that the GPU of the infotainment system mayperform some self-driving functions in the event that the primarycontroller(s) 3536 (e.g., the primary and/or backup computers of thevehicle 3500) fail. In such an example, the infotainment SoC 3530 mayput the vehicle 3500 into a chauffeur to safe stop mode, as describedherein.

The vehicle 3500 may further include an instrument cluster 3532 (e.g., adigital dash, an electronic instrument cluster, a digital instrumentpanel, etc.). The instrument cluster 3532 may include a controllerand/or supercomputer (e.g., a discrete controller or supercomputer). Theinstrument cluster 3532 may include a set of instrumentation such as aspeedometer, fuel level, oil pressure, tachometer, odometer, turnindicators, gearshift position indicator, seat belt warning light(s),parking-brake warning light(s), engine-malfunction light(s), airbag(SRS) system information, lighting controls, safety system controls,navigation information, etc. In some examples, information may bedisplayed and/or shared among the infotainment SoC 3530 and theinstrument cluster 3532. In other words, the instrument cluster 3532 maybe included as part of the infotainment SoC 3530, or vice versa.

FIG. 35D is a system diagram for communication between cloud-basedserver(s) and the example autonomous vehicle 3500 of FIG. 35A, inaccordance with some embodiments of the present disclosure. The system3576 may include server(s) 3578, network(s) 3590, and vehicles,including the vehicle 3500. The server(s) 3578 may include a pluralityof GPUs 3584(A)-3584(H) (collectively referred to herein as GPUs 3584),PCIe switches 3582(A)-3582(H) (collectively referred to herein as PCIeswitches 3582), and/or CPUs 3580(A)-3580(B) (collectively referred toherein as CPUs 3580). The GPUs 3584, the CPUs 3580, and the PCIeswitches may be interconnected with high-speed interconnects such as,for example and without limitation, NVLink interfaces 3588 developed byNVIDIA and/or PCIe connections 3586. In some examples, the GPUs 3584 areconnected via NVLink and/or NVSwitch SoC and the GPUs 3584 and the PCIeswitches 3582 are connected via PCIe interconnects. Although eight GPUs3584, two CPUs 3580, and two PCIe switches are illustrated, this is notintended to be limiting. Depending on the embodiment, each of theserver(s) 3578 may include any number of GPUs 3584, CPUs 3580, and/orPCIe switches. For example, the server(s) 3578 may each include eight,sixteen, thirty-two, and/or more GPUs 3584.

The server(s) 3578 may receive, over the network(s) 3590 and from thevehicles, image data representative of images showing unexpected orchanged road conditions, such as recently commenced road-work. Theserver(s) 3578 may transmit, over the network(s) 3590 and to thevehicles, neural networks 3592, updated neural networks 3592, and/or mapinformation 3594, including information regarding traffic and roadconditions. The updates to the map information 3594 may include updatesfor the HD map 3522, such as information regarding construction sites,potholes, detours, flooding, and/or other obstructions. In someexamples, the neural networks 3592, the updated neural networks 3592,and/or the map information 3594 may have resulted from new trainingand/or experiences represented in data received from any number ofvehicles in the environment, and/or based on training performed at adatacenter (e.g., using the server(s) 3578 and/or other servers).

The server(s) 3578 may be used to train machine learning models (e.g.,neural networks) based on training data. The training data may begenerated by the vehicles, and/or may be generated in a simulation(e.g., using a game engine). In some examples, the training data istagged (e.g., where the neural network benefits from supervisedlearning) and/or undergoes other pre-processing, while in other examplesthe training data is not tagged and/or pre-processed (e.g., where theneural network does not require supervised learning). Training may beexecuted according to any one or more classes of machine learningtechniques, including, without limitation, classes such as: supervisedtraining, semi-supervised training, unsupervised training,self-learning, reinforcement learning, federated learning, transferlearning, feature learning (including principal component and clusteranalyses), multi-linear subspace learning, manifold learning,representation learning (including spare dictionary learning),rule-based machine learning, anomaly detection, and any variants orcombinations therefor. Once the machine learning models are trained, themachine learning models may be used by the vehicles (e.g., transmittedto the vehicles over the network(s) 3590, and/or the machine learningmodels may be used by the server(s) 3578 to remotely monitor thevehicles.

In some examples, the server(s) 3578 may receive data from the vehiclesand apply the data to up-to-date real-time neural networks for real-timeintelligent inferencing. The server(s) 3578 may include deep-learningsupercomputers and/or dedicated AI computers powered by GPU(s) 3584,such as a DGX and DGX Station machines developed by NVIDIA. However, insome examples, the server(s) 3578 may include deep learninginfrastructure that use only CPU-powered datacenters.

The deep-learning infrastructure of the server(s) 3578 may be capable offast, real-time inferencing, and may use that capability to evaluate andverify the health of the processors, software, and/or associatedhardware in the vehicle 3500. For example, the deep-learninginfrastructure may receive periodic updates from the vehicle 3500, suchas a sequence of images and/or objects that the vehicle 3500 has locatedin that sequence of images (e.g., via computer vision and/or othermachine learning object classification techniques). The deep-learninginfrastructure may run its own neural network to identify the objectsand compare them with the objects identified by the vehicle 3500 and, ifthe results do not match and the infrastructure concludes that the AI inthe vehicle 3500 is malfunctioning, the server(s) 3578 may transmit asignal to the vehicle 3500 instructing a fail-safe computer of thevehicle 3500 to assume control, notify the passengers, and complete asafe parking maneuver.

For inferencing, the server(s) 3578 may include the GPU(s) 3584 and oneor more programmable inference accelerators (e.g., NVIDIA's TensorRT).The combination of GPU-powered servers and inference acceleration maymake real-time responsiveness possible. In other examples, such as whereperformance is less critical, servers powered by CPUs, FPGAs, and otherprocessors may be used for inferencing.

Example Computing Device

FIG. 36 is a block diagram of an example computing device(s) 3600suitable for use in implementing some embodiments of the presentdisclosure. Computing device 3600 may include an interconnect system3602 that directly or indirectly couples the following devices: memory3604, one or more central processing units (CPUs) 3606, one or moregraphics processing units (GPUs) 3608, a communication interface 3610,input/output (I/O) ports 3612, input/output components 3614, a powersupply 3616, one or more presentation components 3618 (e.g.,display(s)), and one or more logic units 3620. In at least oneembodiment, the computing device(s) 3600 may comprise one or morevirtual machines (VMs), and/or any of the components thereof maycomprise virtual components (e.g., virtual hardware components). Fornon-limiting examples, one or more of the GPUs 3608 may comprise one ormore vGPUs, one or more of the CPUs 3606 may comprise one or more vCPUs,and/or one or more of the logic units 3620 may comprise one or morevirtual logic units. As such, a computing device(s) 3600 may includediscrete components (e.g., a full GPU dedicated to the computing device3600), virtual components (e.g., a portion of a GPU dedicated to thecomputing device 3600), or a combination thereof.

Although the various blocks of FIG. 36 are shown as connected via theinterconnect system 3602 with lines, this is not intended to be limitingand is for clarity only. For example, in some embodiments, apresentation component 3618, such as a display device, may be consideredan I/O component 3614 (e.g., if the display is a touch screen). Asanother example, the CPUs 3606 and/or GPUs 3608 may include memory(e.g., the memory 3604 may be representative of a storage device inaddition to the memory of the GPUs 3608, the CPUs 3606, and/or othercomponents). In other words, the computing device of FIG. 36 is merelyillustrative. Distinction is not made between such categories as“workstation,” “server,” “laptop,” “desktop,” “tablet,” “client device,”“mobile device,” “hand-held device,” “game console,” “electronic controlunit (ECU),” “virtual reality system,” and/or other device or systemtypes, as all are contemplated within the scope of the computing deviceof FIG. 36 .

The interconnect system 3602 may represent one or more links or busses,such as an address bus, a data bus, a control bus, or a combinationthereof. The interconnect system 3602 may include one or more bus orlink types, such as an industry standard architecture (ISA) bus, anextended industry standard architecture (EISA) bus, a video electronicsstandards association (VESA) bus, a peripheral component interconnect(PCI) bus, a peripheral component interconnect express (PCIe) bus,and/or another type of bus or link. In some embodiments, there aredirect connections between components. As an example, the CPU 3606 maybe directly connected to the memory 3604. Further, the CPU 3606 may bedirectly connected to the GPU 3608. Where there is direct, orpoint-to-point connection between components, the interconnect system3602 may include a PCIe link to carry out the connection. In theseexamples, a PCI bus need not be included in the computing device 3600.

The memory 3604 may include any of a variety of computer-readable media.The computer-readable media may be any available media that may beaccessed by the computing device 3600. The computer-readable media mayinclude both volatile and nonvolatile media, and removable andnon-removable media. By way of example, and not limitation, thecomputer-readable media may comprise computer-storage media andcommunication media.

The computer-storage media may include both volatile and nonvolatilemedia and/or removable and non-removable media implemented in any methodor technology for storage of information such as computer-readableinstructions, data structures, program modules, and/or other data types.For example, the memory 3604 may store computer-readable instructions(e.g., that represent a program(s) and/or a program element(s), such asan operating system. Computer-storage media may include, but is notlimited to, RAM, ROM, EEPROM, flash memory or other memory technology,CD-ROM, digital versatile disks (DVD) or other optical disk storage,magnetic cassettes, magnetic tape, magnetic disk storage or othermagnetic storage devices, or any other medium which may be used to storethe desired information and which may be accessed by computing device3600. As used herein, computer storage media does not comprise signalsper se.

The computer storage media may embody computer-readable instructions,data structures, program modules, and/or other data types in a modulateddata signal such as a carrier wave or other transport mechanism andincludes any information delivery media. The term “modulated datasignal” may refer to a signal that has one or more of itscharacteristics set or changed in such a manner as to encode informationin the signal. By way of example, and not limitation, the computerstorage media may include wired media such as a wired network ordirect-wired connection, and wireless media such as acoustic, RF,infrared and other wireless media. Combinations of any of the aboveshould also be included within the scope of computer-readable media.

The CPU(s) 3606 may be configured to execute at least some of thecomputer-readable instructions to control one or more components of thecomputing device 3600 to perform one or more of the methods and/orprocesses described herein. The CPU(s) 3606 may each include one or morecores (e.g., one, two, four, eight, twenty-eight, seventy-two, etc.)that are capable of handling a multitude of software threadssimultaneously. The CPU(s) 3606 may include any type of processor, andmay include different types of processors depending on the type ofcomputing device 3600 implemented (e.g., processors with fewer cores formobile devices and processors with more cores for servers). For example,depending on the type of computing device 3600, the processor may be anAdvanced RISC Machines (ARM) processor implemented using ReducedInstruction Set Computing (RISC) or an x86 processor implemented usingComplex Instruction Set Computing (CISC). The computing device 3600 mayinclude one or more CPUs 3606 in addition to one or more microprocessorsor supplementary co-processors, such as math co-processors.

In addition to or alternatively from the CPU(s) 3606, the GPU(s) 3608may be configured to execute at least some of the computer-readableinstructions to control one or more components of the computing device3600 to perform one or more of the methods and/or processes describedherein. One or more of the GPU(s) 3608 may be an integrated GPU (e.g.,with one or more of the CPU(s) 3606 and/or one or more of the GPU(s)3608 may be a discrete GPU. In embodiments, one or more of the GPU(s)3608 may be a coprocessor of one or more of the CPU(s) 3606. The GPU(s)3608 may be used by the computing device 3600 to render graphics (e.g.,3D graphics) or perform general purpose computations. For example, theGPU(s) 3608 may be used for General-Purpose computing on GPUs (GPGPU).The GPU(s) 3608 may include hundreds or thousands of cores that arecapable of handling hundreds or thousands of software threadssimultaneously. The GPU(s) 3608 may generate pixel data for outputimages in response to rendering commands (e.g., rendering commands fromthe CPU(s) 3606 received via a host interface). The GPU(s) 3608 mayinclude graphics memory, such as display memory, for storing pixel dataor any other suitable data, such as GPGPU data. The display memory maybe included as part of the memory 3604. The GPU(s) 3608 may include twoor more GPUs operating in parallel (e.g., via a link). The link maydirectly connect the GPUs (e.g., using NVLINK) or may connect the GPUsthrough a switch (e.g., using NVSwitch). When combined together, eachGPU 3608 may generate pixel data or GPGPU data for different portions ofan output or for different outputs (e.g., a first GPU for a first imageand a second GPU for a second image). Each GPU may include its ownmemory, or may share memory with other GPUs.

In addition to or alternatively from the CPU(s) 3606 and/or the GPU(s)3608, the logic unit(s) 3620 may be configured to execute at least someof the computer-readable instructions to control one or more componentsof the computing device 3600 to perform one or more of the methodsand/or processes described herein. In embodiments, the CPU(s) 3606, theGPU(s) 3608, and/or the logic unit(s) 3620 may discretely or jointlyperform any combination of the methods, processes and/or portionsthereof. One or more of the logic units 3620 may be part of and/orintegrated in one or more of the CPU(s) 3606 and/or the GPU(s) 3608and/or one or more of the logic units 3620 may be discrete components orotherwise external to the CPU(s) 3606 and/or the GPU(s) 3608. Inembodiments, one or more of the logic units 3620 may be a coprocessor ofone or more of the CPU(s) 3606 and/or one or more of the GPU(s) 3608.

Examples of the logic unit(s) 3620 include one or more processing coresand/or components thereof, such as Data Processing Units (DPUs), TensorCores (TCs), Tensor Processing Units (TPUs), Pixel Visual Cores (PVCs),Vision Processing Units (VPUs), Graphics Processing Clusters (GPCs),Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs),Tree Traversal Units (TTUs), Artificial Intelligence Accelerators(AIAs), Deep Learning Accelerators (DLAs), Arithmetic-Logic Units(ALUs), Application-Specific Integrated Circuits (ASICs), Floating PointUnits (FPUs), input/output (I/O) elements, peripheral componentinterconnect (PCI) or peripheral component interconnect express (PCIe)elements, and/or the like.

The communication interface 3610 may include one or more receivers,transmitters, and/or transceivers that enable the computing device 3600to communicate with other computing devices via an electroniccommunication network, included wired and/or wireless communications.The communication interface 3610 may include components andfunctionality to enable communication over any of a number of differentnetworks, such as wireless networks (e.g., Wi-Fi, Z-Wave, Bluetooth,Bluetooth LE, ZigBee, etc.), wired networks (e.g., communicating overEthernet or InfiniBand), low-power wide-area networks (e.g., LoRaWAN,SigFox, etc.), and/or the Internet. In one or more embodiments, logicunit(s) 3620 and/or communication interface 3610 may include one or moredata processing units (DPUs) to transmit data received over a networkand/or through interconnect system 3602 directly to (e.g., a memory of)one or more GPU(s) 3608.

The I/O ports 3612 may enable the computing device 3600 to be logicallycoupled to other devices including the I/O components 3614, thepresentation component(s) 3618, and/or other components, some of whichmay be built in to (e.g., integrated in) the computing device 3600.Illustrative I/O components 3614 include a microphone, mouse, keyboard,joystick, game pad, game controller, satellite dish, scanner, printer,wireless device, etc. The I/O components 3614 may provide a natural userinterface (NUI) that processes air gestures, voice, or otherphysiological inputs generated by a user. In some instances, inputs maybe transmitted to an appropriate network element for further processing.An NUI may implement any combination of speech recognition, stylusrecognition, facial recognition, biometric recognition, gesturerecognition both on screen and adjacent to the screen, air gestures,head and eye tracking, and touch recognition (as described in moredetail below) associated with a display of the computing device 3600.The computing device 3600 may be include depth cameras, such asstereoscopic camera systems, infrared camera systems, RGB camerasystems, touchscreen technology, and combinations of these, for gesturedetection and recognition. Additionally, the computing device 3600 mayinclude accelerometers or gyroscopes (e.g., as part of an inertiameasurement unit (IMU)) that enable detection of motion. In someexamples, the output of the accelerometers or gyroscopes may be used bythe computing device 3600 to render immersive augmented reality orvirtual reality.

The power supply 3616 may include a hard-wired power supply, a batterypower supply, or a combination thereof. The power supply 3616 mayprovide power to the computing device 3600 to enable the components ofthe computing device 3600 to operate.

The presentation component(s) 3618 may include a display (e.g., amonitor, a touch screen, a television screen, a heads-up-display (HUD),other display types, or a combination thereof), speakers, and/or otherpresentation components. The presentation component(s) 3618 may receivedata from other components (e.g., the GPU(s) 3608, the CPU(s) 3606,DPUs, etc.), and output the data (e.g., as an image, video, sound,etc.).

The techniques disclosed herein may be incorporated in any processorthat may be used for processing a neural network, such as, for example,a central processing unit (CPU), a GPU, an intelligence processing unit(IPU), neural processing unit (NPU), tensor processing unit (TPU), aneural network processor (NNP), a data processing unit (DPU), a visionprocessing unit (VPU), an application-specific integrated circuit(ASIC), a field-programmable gate array (FPGA), and the like. Such aprocessor may be incorporated in a personal computer (e.g., a laptop),at a data center, in an Internet of Things (IoT) device, a handhelddevice (e.g., smartphone), a vehicle, a robot, a voice-controlleddevice, or any other device that performs inference, training or anyother processing of a neural network. Such a processor may be employedin a virtualized system such that an operating system executing in avirtual machine on the system can utilize the processor.

As an example, a processor incorporating the techniques disclosed hereincan be employed to process one or more neural networks in a machine toidentify, classify, manipulate, handle, operate, modify, or navigatearound physical objects in the real world. For example, such a processormay be employed in an autonomous vehicle (e.g., an automobile,motorcycle, helicopter, drone, plane, boat, submarine, delivery robot,etc.) to move the vehicle through the real world. Additionally, such aprocessor may be employed in a robot at a factory to select componentsand assemble components into an assembly.

As an example, a processor incorporating the techniques disclosed hereincan be employed to process one or more neural networks to identify oneor more features in an image or alter, generate, or compress an image.For example, such a processor may be employed to enhance an image thatis rendered using raster, ray-tracing (e.g., using NVIDIA RTX), and/orother rendering techniques. In another example, such a processor may beemployed to reduce the amount of image data that is transmitted over anetwork (e.g., the Internet, a mobile telecommunications network, a WIFInetwork, as well as any other wired or wireless networking system) froma rendering device to a display device. Such transmissions may beutilized to stream image data from a server or a data center in thecloud to a user device (e.g., a personal computer, video game console,smartphone, other mobile devices, etc.) to enhance services that streamimages such as NVIDIA GeForce Now (GFN), Google Stadia, and the like.

As an example, a processor incorporating the techniques disclosed hereincan be employed to process one or more neural networks for any othertypes of applications that can take advantage of a neural network. Forexample, such applications may involve translating languages,identifying and negating sounds in audio, detecting anomalies or defectsduring the production of goods and services, surveillance of livingbeings and non-living things, medical diagnosis, making decisions, andthe like.

Example Data Center

FIG. 37 illustrates an example data center 3700 that may be used in atleast one embodiments of the present disclosure. The data center 3700may include a data center infrastructure layer 3710, a framework layer3720, a software layer 3730, and/or an application layer 3740.

As shown in FIG. 37 , the data center infrastructure layer 3710 mayinclude a resource orchestrator 3712, grouped computing resources 3714,and node computing resources (“node C.R.s”) 3716(1)-3716(N), where “N”represents any whole, positive integer. In at least one embodiment, nodeC.R.s 3716(1)-3716(N) may include, but are not limited to, any number ofcentral processing units (CPUs) or other processors (including DPUs,accelerators, field programmable gate arrays (FPGAs), graphicsprocessors or graphics processing units (GPUs), etc.), memory devices(e.g., dynamic read-only memory), storage devices (e.g., solid state ordisk drives), network input/output (NW I/O) devices, network switches,virtual machines (VMs), power modules, and/or cooling modules, etc. Insome embodiments, one or more node C.R.s from among node C.R.s3716(1)-3716(N) may correspond to a server having one or more of theabove-mentioned computing resources. In addition, in some embodiments,the node C.R.s 3716(1)-37161(N) may include one or more virtualcomponents, such as vGPUs, vCPUs, and/or the like, and/or one or more ofthe node C.R.s 3716(1)-3716(N) may correspond to a virtual machine (VM).

In at least one embodiment, grouped computing resources 3714 may includeseparate groupings of node C.R.s 3716 housed within one or more racks(not shown), or many racks housed in data centers at variousgeographical locations (also not shown). Separate groupings of nodeC.R.s 3716 within grouped computing resources 3714 may include groupedcompute, network, memory or storage resources that may be configured orallocated to support one or more workloads. In at least one embodiment,several node C.R.s 3716 including CPUs, GPUs, DPUs, and/or otherprocessors may be grouped within one or more racks to provide computeresources to support one or more workloads. The one or more racks mayalso include any number of power modules, cooling modules, and/ornetwork switches, in any combination.

The resource orchestrator 3712 may configure or otherwise control one ormore node C.R.s 3716(1)-3716(N) and/or grouped computing resources 3714.In at least one embodiment, resource orchestrator 3712 may include asoftware design infrastructure (SDI) management entity for the datacenter 3700. The resource orchestrator 3712 may include hardware,software, or some combination thereof.

In at least one embodiment, as shown in FIG. 37 , framework layer 3720may include a job scheduler 3733, a configuration manager 3734, aresource manager 3736, and/or a distributed file system 3738. Theframework layer 3720 may include a framework to support software 3732 ofsoftware layer 3730 and/or one or more application(s) 3742 ofapplication layer 3740. The software 3732 or application(s) 3742 mayrespectively include web-based service software or applications, such asthose provided by Amazon Web Services, Google Cloud and Microsoft Azure.The framework layer 3720 may be, but is not limited to, a type of freeand open-source software web application framework such as Apache Spark™(hereinafter “Spark”) that may utilize distributed file system 3738 forlarge-scale data processing (e.g., “big data”). In at least oneembodiment, job scheduler 3733 may include a Spark driver to facilitatescheduling of workloads supported by various layers of data center 3700.The configuration manager 3734 may be capable of configuring differentlayers such as software layer 3730 and framework layer 3720 includingSpark and distributed file system 3738 for supporting large-scale dataprocessing. The resource manager 3736 may be capable of managingclustered or grouped computing resources mapped to or allocated forsupport of distributed file system 3738 and job scheduler 3733. In atleast one embodiment, clustered or grouped computing resources mayinclude grouped computing resource 3714 at data center infrastructurelayer 3710. The resource manager 3736 may coordinate with resourceorchestrator 3712 to manage these mapped or allocated computingresources.

In at least one embodiment, software 3732 included in software layer3730 may include software used by at least portions of node C.R.s3716(1)-3716(N), grouped computing resources 3714, and/or distributedfile system 3738 of framework layer 3720. One or more types of softwaremay include, but are not limited to, Internet web page search software,e-mail virus scan software, database software, and streaming videocontent software.

In at least one embodiment, application(s) 3742 included in applicationlayer 3740 may include one or more types of applications used by atleast portions of node C.R.s 3716(1)-3716(N), grouped computingresources 3714, and/or distributed file system 3738 of framework layer3720. One or more types of applications may include, but are not limitedto, any number of a genomics application, a cognitive compute, and amachine learning application, including training or inferencingsoftware, machine learning framework software (e.g., PyTorch,TensorFlow, Caffe, etc.), and/or other machine learning applicationsused in conjunction with one or more embodiments.

In at least one embodiment, any of configuration manager 3734, resourcemanager 3736, and resource orchestrator 3712 may implement any numberand type of self-modifying actions based on any amount and type of dataacquired in any technically feasible fashion. Self-modifying actions mayrelieve a data center operator of data center 3700 from making possiblybad configuration decisions and possibly avoiding underutilized and/orpoor performing portions of a data center.

The data center 3700 may include tools, services, software or otherresources to train one or more machine learning models or predict orinfer information using one or more machine learning models according toone or more embodiments described herein. For example, a machinelearning model(s) may be trained by calculating weight parametersaccording to a neural network architecture using software and/orcomputing resources described above with respect to the data center3700. In at least one embodiment, trained or deployed machine learningmodels corresponding to one or more neural networks may be used to inferor predict information using resources described above with respect tothe data center 3700 by using weight parameters calculated through oneor more training techniques, such as but not limited to those describedherein.

In at least one embodiment, the data center 3700 may use CPUs,application-specific integrated circuits (ASICs), GPUs, FPGAs, and/orother hardware (or virtual compute resources corresponding thereto) toperform training and/or inferencing using above-described resources.Moreover, one or more software and/or hardware resources described abovemay be configured as a service to allow users to train or performinginferencing of information, such as image recognition, speechrecognition, or other artificial intelligence services.

Example Network Environments

Network environments suitable for use in implementing embodiments of thedisclosure may include one or more client devices, servers, networkattached storage (NAS), other backend devices, and/or other devicetypes. The client devices, servers, and/or other device types (e.g.,each device) may be implemented on one or more instances of thecomputing device(s) 3600 of FIG. 36 —e.g., each device may includesimilar components, features, and/or functionality of the computingdevice(s) 3600. In addition, where backend devices (e.g., servers, NAS,etc.) are implemented, the backend devices may be included as part of adata center 3700, an example of which is described in more detail hereinwith respect to FIG. 37 .

Components of a network environment may communicate with each other viaa network(s), which may be wired, wireless, or both. The network mayinclude multiple networks, or a network of networks. By way of example,the network may include one or more Wide Area Networks (WANs), one ormore Local Area Networks (LANs), one or more public networks such as theInternet and/or a public switched telephone network (PSTN), and/or oneor more private networks. Where the network includes a wirelesstelecommunications network, components such as a base station, acommunications tower, or even access points (as well as othercomponents) may provide wireless connectivity.

Compatible network environments may include one or more peer-to-peernetwork environments—in which case a server may not be included in anetwork environment—and one or more client-server networkenvironments—in which case one or more servers may be included in anetwork environment. In peer-to-peer network environments, functionalitydescribed herein with respect to a server(s) may be implemented on anynumber of client devices.

In at least one embodiment, a network environment may include one ormore cloud-based network environments, a distributed computingenvironment, a combination thereof, etc. A cloud-based networkenvironment may include a framework layer, a job scheduler, a resourcemanager, and a distributed file system implemented on one or more ofservers, which may include one or more core network servers and/or edgeservers. A framework layer may include a framework to support softwareof a software layer and/or one or more application(s) of an applicationlayer. The software or application(s) may respectively include web-basedservice software or applications. In embodiments, one or more of theclient devices may use the web-based service software or applications(e.g., by accessing the service software and/or applications via one ormore application programming interfaces (APIs)). The framework layer maybe, but is not limited to, a type of free and open-source software webapplication framework such as that may use a distributed file system forlarge-scale data processing (e.g., “big data”).

A cloud-based network environment may provide cloud computing and/orcloud storage that carries out any combination of computing and/or datastorage functions described herein (or one or more portions thereof).Any of these various functions may be distributed over multiplelocations from central or core servers (e.g., of one or more datacenters that may be distributed across a state, a region, a country, theglobe, etc.). If a connection to a user (e.g., a client device) isrelatively close to an edge server(s), a core server(s) may designate atleast a portion of the functionality to the edge server(s). Acloud-based network environment may be private (e.g., limited to asingle organization), may be public (e.g., available to manyorganizations), and/or a combination thereof (e.g., a hybrid cloudenvironment).

The client device(s) may include at least some of the components,features, and functionality of the example computing device(s) 3600described herein with respect to FIG. 36 . By way of example and notlimitation, a client device may be embodied as a Personal Computer (PC),a laptop computer, a mobile device, a smartphone, a tablet computer, asmart watch, a wearable computer, a Personal Digital Assistant (PDA), anMP3 player, a virtual reality headset, a Global Positioning System (GPS)or device, a video player, a video camera, a surveillance device orsystem, a vehicle, a boat, a flying vessel, a virtual machine, a drone,a robot, a handheld communications device, a hospital device, a gamingdevice or system, an entertainment system, a vehicle computer system, anembedded system controller, a remote control, an appliance, a consumerelectronic device, a workstation, an edge device, any combination ofthese delineated devices, or any other suitable device.

The disclosure may be described in the general context of computer codeor machine-useable instructions, including computer-executableinstructions such as program modules, being executed by a computer orother machine, such as a personal data assistant or other handhelddevice. Generally, program modules including routines, programs,objects, components, data structures, etc., refer to code that performparticular tasks or implement particular abstract data types. Thedisclosure may be practiced in a variety of system configurations,including hand-held devices, consumer electronics, general-purposecomputers, more specialty computing devices, etc. The disclosure mayalso be practiced in distributed computing environments where tasks areperformed by remote-processing devices that are linked through acommunications network.

As used herein, a recitation of “and/or” with respect to two or moreelements should be interpreted to mean only one element, or acombination of elements. For example, “element A, element B, and/orelement C” may include only element A, only element B, only element C,element A and element B, element A and element C, element B and elementC, or elements A, B, and C. In addition, “at least one of element A orelement B” may include at least one of element A, at least one ofelement B, or at least one of element A and at least one of element B.Further, “at least one of element A and element B” may include at leastone of element A, at least one of element B, or at least one of elementA and at least one of element B.

The subject matter of the present disclosure is described withspecificity herein to meet statutory requirements. However, thedescription itself is not intended to limit the scope of thisdisclosure. Rather, the inventors have contemplated that the claimedsubject matter might also be embodied in other ways, to includedifferent steps or combinations of steps similar to the ones describedin this document, in conjunction with other present or futuretechnologies. Moreover, although the terms “step” and/or “block” may beused herein to connote different elements of methods employed, the termsshould not be interpreted as implying any particular order among orbetween various steps herein disclosed unless and except when the orderof individual steps is explicitly described.

What is claimed is:
 1. A method comprising: generating, using sensordata captured during a first time slice, two or more image framesrepresentative of two or more at least partially overlapping viewpointsaround an ego-object in an environment; updating a candidate positionfor a seam to an updated position based at least on an intersection ofthe seam at the candidate position with one or more pixels of one ormore detected objects in at least one projected mask of the two or moreprojected masks; and generating a composite image frame based at leaston stitching the two or more image frames using the updated position ofthe seam.
 2. The method of claim 1, wherein the updating of thecandidate position for the seam to the updated position comprises:generating two or more projected masks comprising at least partiallyoverlapping representations of one or more detected objects depicted inthe two or more image frames; determining a candidate position for aseam in an overlapping region of the two or more image frames; andupdating the candidate position for the seam to an updated positionbased at least on an intersection of the seam at the candidate positionwith one or more pixels of the one or more detected objects in at leastone projected mask of the two or more projected masks.
 3. The method ofclaim 1, wherein the updating of the candidate position for the seam tothe updated position causes the updated position to be at leastpartially horizontal or an at least partially vertical.
 4. The method ofclaim 1, further comprising generating the two or more projected masksbased at least on projecting two or more corresponding binary objectmasks onto an overlapping portion of the two or more image frames, thetwo or more binary object masks being indicative of whether or not eachpixel in a corresponding one of the two or more image frames correspondsto one of the one or more detected objects.
 5. The method of claim 1,further comprising generating the two or more projected masks based atleast on projecting two or more corresponding weighted saliency masksonto an overlapping portion of the two or more image frames, the two ormore weighted saliency masks representing a measure of saliency of eachpixel in a corresponding one of the two or more image frames.
 6. Themethod of claim 1, further comprising generating the two or moreprojected masks based at least on omitting a subset of an initial set ofdetected objects determined to be beyond a threshold distance from theego-object.
 7. The method of claim 1, further comprising generating thetwo or more projected masks based at least on projecting two or morecorresponding weighted saliency masks, prioritizing one or more classesof the one or more detected objects, onto an overlapping portion of thetwo or more image frames.
 8. The method of claim 1, further comprisinggenerating the two or more projected masks based at least on projectingtwo or more corresponding weighted saliency masks, weightingcorresponding binary object masks based at least on proximity to the oneor more detected objects, onto an overlapping portion of the two or moreimage frames.
 9. The method of claim 1, wherein the ego-object is amedical probe, and the two or more viewpoints comprise at least one of amulti-view or a proximity view corresponding to the medical probe. 10.The method of claim 1, wherein the method is performed by at least oneof: a control system for an autonomous or semi-autonomous machine; aperception system for an autonomous or semi-autonomous machine; a systemfor performing simulation operations; a system for performing digitaltwin operations; a system for performing light transport simulation; asystem for performing collaborative content creation for 3D assets; asystem for performing deep learning operations; a system implementedusing an edge device; a system implemented using a robot; a system forperforming conversational AI operations; a system for generatingsynthetic data; a system incorporating one or more virtual machines(VMs); a system implemented at least partially in a data center; or asystem implemented at least partially using cloud computing resources.11. A processor comprising: one or more circuits to: obtain image dataof two or more image frames corresponding to two or more separateviewpoints that share at least a portion of an area in an environment;detect one or more objects depicted in the two or more image frames;determine a candidate position for a seam in an aligned representationof the two or more image frames; update the candidate position of theseam to an updated position based at least on an intersection of theseam at the candidate position with one or more pixels in the alignedrepresentation that at least partially depict the one or more objects;and generate a composite image based at least on stitching the two ormore image frames using the updated position of the seam.
 12. Theprocessor of claim 11, the one or more circuits further to update thecandidate position for the seam to the updated position based at leaston reducing a number of the one or more pixels in the alignedrepresentation of the two or more image frames that at least partiallydepict the one or more objects.
 13. The processor of claim 11, the oneor more circuits further to update the candidate position for the seamto the updated position corresponding to an at least partiallyhorizontal position or an at least partially vertical position.
 14. Theprocessor of claim 11, the one or more circuits further to determinewhether the seam at the candidate position intersects the one or morepixels in the aligned representation that at least partially depict theone or more objects based at least on comparing the candidate positionof the seam with corresponding pixels in two or more projected masksrepresenting positions of the one or more detected objects in thealigned representation.
 15. The processor of claim 11, wherein theprocessor is comprised in at least one of: a control system for anautonomous or semi-autonomous machine; a perception system for anautonomous or semi-autonomous machine; a system for performingsimulation operations; a system for performing digital twin operations;a system for performing light transport simulation; a system forperforming collaborative content creation for 3D assets; a system forperforming deep learning operations; a system implemented using an edgedevice; a system implemented using a robot; a system for performingconversational AI operations; a system for generating synthetic data; asystem incorporating one or more virtual machines (VMs); a systemimplemented at least partially in a data center; or a system implementedat least partially using cloud computing resources.
 16. A systemcomprising: one or more processing units to update a candidate positionfor a seam in an overlapping region of two or more image frames based atleast on an intersection with an initial candidate position of the seamand one or more pixels corresponding to one or more detected objects ofa designated object class that are at least partially depicted in theoverlapping region, and to generate a composite image based at least onstitching the two or more image frames using a seam at the updatedcandidate position.
 17. The system of claim 16, wherein the one or moreprocessing units update the candidate position for the seam bygenerating two or more projected masks corresponding to the one or moredetected objects based at least on projecting two or more correspondingweighted saliency masks onto an overlapping portion of the two or moreimage frames.
 18. The system of claim 17, wherein the two or moreweighted saliency masks represent a measure of saliency of at least onepixel in a corresponding one of the two or more image frames.
 19. Thesystem of claim 17, wherein projecting the two or more correspondingweighted saliency masks onto an overlapping portion of the two or moreimage frames comprises prioritizing one or more object classes of theone or more detected objects.
 20. The system of claim 16, wherein thesystem is comprised in at least one of: a control system for anautonomous or semi-autonomous machine; a perception system for anautonomous or semi-autonomous machine; a system for performingsimulation operations; a system for performing digital twin operations;a system for performing deep learning operations; a system implementedusing an edge device; a system implemented using a robot; a systemincorporating one or more virtual machines (VMs); a system implementedat least partially in a data center; a system for performing lighttransport simulation; a system for performing collaborative contentcreation for 3D assets; a system for generating synthetic data; or asystem implemented at least partially using cloud computing resources.